broadcast
is used to ship the object to every worker node. This object is going to be shared among all partitions on that node (and the value/i.e. object is the same for every node in the cluster). The goal of broadcasting is to save on network costs when you use the same data in many different tasks/partitions on the worker node.
mapPartitions
in contrast, is a method available on RDDs, and works like map
, only on partitions. Yes, you can define new objects, such as a jdbc connection, which will then be unique to each partition. However, you can't share it among different partitions, and much less among different nodes.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…