Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
424 views
in Technique[技术] by (71.8m points)

hiveql - Can you explain when and why mapreduce is invoked in hive

  1. select * from Table_name limit 5;

  2. select col1_name,col2_name from table_name limit 5;

When i run the first query there will be no MapReduce invoked, while for other the MapReduce is invoked. Could you please explain the reason.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Take the simple hive query below:

Describe table;

This reads data from the hive metastore and is the simplist and fastest query in hive.

select * from table;

This query needs only read data from HDFS. So far neither requires any map or reduce phases.

select * from table where color in ('RED','WHITE','BLUE')

This query requires only a map only, there is no reduce phase. There is no aggregation function of any kind. Here we are filtering to collect records that are RED, WHITE, or BLUE.

select count(1) from table;

This query requires only a reduce phase. No mapping required because we are counting all the records in the table. If we want to count across elements then we will be adding a map phase prior to the reduce phase. See below:

Select color
, count(1) as color_count 
  from table  
  group by color;

This query has an aggregation function and a group by statement. We are counting the number of elements in the table that are RED, WHITE, or BLUE. This counting requires a map and a reduce job.

Essentially we create a key value pair in the above job. We map records to a key. In this case it will be RED, WHITE, and BLUE. Then a value of one is made. So the key:value is color:1. Then we can sum the value across the key color. This is a map and reduce job.

Now take the same query and an order by clause.

Select color
, count(1) as color_count 
  from table  
  group by color
  order by colour_count desc;

This adds another reduce phase and forces a single reducer for the data set to passed through. This is necessary because we want to ensure that global ordering is maintained. Count(distinct color) also forces a single reducer and requires a map and reduce phase.

As you add complexity to your hive query you in a similar fashion add map and reduce jobs required to obtain the requested results.

If you want to find out how hive will manage a query you can use the explain caluse in front of your query.

 Explain select * from table;

This can give you an idea of how the query is being executed under the hood. It will show you dependencies of stages and to what if any aggregations are resulting in reduce jobs and operators are resulting in map jobs.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

56.9k users

...