Use .repartition(1)
or as @blackbishop says, coalesce(1) to say "I only want one partition on the output"
- use a subdir as things don't like writing to the root path. It's not a normal directory
- filenames get chosen by the partition code, best to list the dir for the single file and rename.
it should look something like this
val dest = "s3://"+target_bucket_name + "/subdir"
val destPath = newPath(dest)
val fs = Filesystem.get(destPath, conf) // where conf is the hadoop conf from your spark conf
fs.delete(destPath, true)
file_spark_df.parquet.repartition(1).write.(dest)
// at this point there should be only one file in the dest dir
val files = fs.listStatus(destPath) // array of fileStatus of size == 1
if (fs.size != 1) throw new IOException("Wrong number of files in " + destPath)
fs.rename(files[0].getPath(), new Path(destPath, "final-filename.parquet")
(note, code written @ console, not compiled, tested etc. You should get the idea though)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…