Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
799 views
in Technique[技术] by (71.8m points)

sorting - How does Spark achieve sort order?

Assume I have a list of Strings. I filter & sort them, and collect the result to driver. However, things are distributed, and each RDD has it's own part of original list. So, how does Spark achieve the final sorted order, does it merge results?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Sorting in Spark is a multiphase process which requires shuffling:

  1. input RDD is sampled and this sample is used to compute boundaries for each output partition (sample followed by collect)
  2. input RDD is partitioned using rangePartitioner with boundaries computed in the first step (partitionBy)
  3. each partition from the second step is sorted locally (mapPartitions)

When the data is collected, all that is left is to follow the order defined by the partitioner.

Above steps are clearly reflected in a debug string:

scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ...

scala> rdd.sortBy(identity).toDebugString
res1: String = 
(6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions
 |  ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle
 +-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps
    |  ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...