I had the exact same problem and I found a way to do this using DataFrame.repartition()
. The problem with using coalesce(1)
is that your parallelism drops to 1, and it can be slow at best and error out at worst. Increasing that number doesn't help either -- if you do coalesce(10)
you get more parallelism, but end up with 10 files per partition.
To get one file per partition without using coalesce()
, use repartition()
with the same columns you want the output to be partitioned by. So in your case, do this:
import spark.implicits._
df.repartition($"entity", $"year", $"month", $"day", $"status").write.partitionBy("entity", "year", "month", "day", "status").mode(SaveMode.Append).parquet(s"$location")
Once I do that I get one parquet file per output partition, instead of multiple files.
I tested this in Python, but I assume in Scala it should be the same.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…