Some 'guidance' off the top of my head, noting that title and body of text differ to a degree:
Question 1:
A JOIN
will do any (hash) partitioning / repartitioning required automatically - if needed and if not using a Broadcast JOIN. You may
set the number of partitions for shuffling or use the default - 200.
There are more parties (DF's) to consider.
repartition is a transformation, so any up-front repartition may not be executed at all due to Catalyst optimization - see the physical plan
generated from the .explain
. That's the deal with lazy
evaluation - determining if something is necessary upon Action
invocation.
Question 2:
If you have a use case to JOIN
certain input / output regularly, then using Spark's bucketBy
is a good approach. It obviates shuffling. The
databricks docs show this clearly.
A Spark schema using bucketBy
is NOT compatible with Hive
. so these remain Spark only tables, unless this changed recently.
Using Hive partitioning as you state depend on push-down logic, partition pruning etc. It should work as well but you may have have
different number of partitions inside Spark framework after the read.
It's a bit more complicated than saying I have N partitions so I will
get N partitions on the initial read.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…