If you have huge array that is accessed from Spark Closures, for example some reference data, this array will be shipped to each spark node with closure. For example if you have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to each node).
If you use broadcast it will be distributed once per node using efficient p2p protocol.
val array: Array[Int] = ??? // some huge array
val broadcasted = sc.broadcast(array)
And some RDD
val rdd: RDD[Int] = ???
In this case array will be shipped with closure each time
rdd.map(i => array.contains(i))
and with broadcast you'll get huge performance benefit
rdd.map(i => broadcasted.value.contains(i))
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…