Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

pyspark - Spark parquet compression and encoding schemes

I need to encode parquet files which are produced by my pyspark script, so that the encoding is using RLE_DICTIONARY (https://www.slideshare.net/databricks/the-parquet-format-and-performance-optimization-opportunities).

Secondly, I need the compression to be applied, but not on the full file level, but I need the row group (split unit) level compression - ideally with snappy, so we can support parallel reads from Redshift Spectrum (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html).

However, looking at the official parquet docs, there are only few parquet related properties that can be set (https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html#configuration). This property:

spark.sql.parquet.compression.codec 

defaults to snappy, but does that apply file level or split level compression (i.e. does it first produce parquet file and then snappy compresses, or first it snappy compresses row groups - splits, and then produces the file level?)

What is the default behavior here? Does the default behavior meet my requirement of applying split chunk compression instead of file level compression? Is the RLE_DICTIONARY a default encoding used by Spark? I couldn't find an option to define encoding itself?

question from:https://stackoverflow.com/questions/65844890/spark-parquet-compression-and-encoding-schemes

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...