Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
118 views
in Technique[技术] by (71.8m points)

java - Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me.

Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:

Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.

My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?

Thanks, Chander

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Check out merge-sort.

It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.

If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.

Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...