python - How to efficiently write multiple pyarrow tables (>1,000 tables) to a partitioned parquet dataset?

Question

Welcome To Ask or Share your Answers For Others

python - How to efficiently write multiple pyarrow tables (>1,000 tables) to a partitioned parquet dataset?

asked Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to efficiently write multiple pyarrow tables (>1,000 tables) to a partitioned parquet dataset?

I have some big files (around 7,000 in total, 4GB each) in other formats that I want to store into a partitioned (hive) directory using the pyarrow.parquet.write_to_dataset() for fast query.

Currently, I am looping over all the files using the following process:

import pyarrow as pa
import pyarrow.parquet as pq

for each_file in file_list:
    ndarray_temp = reader(each_file)
    table_temp = pa.Table.from_arrays(ndarray_temp)
    pq.write_to_dataset(table_temp, root_path='xxx', partition_cols=[...])

This is quite slow as pq.write_to_dataset() takes about 27s to write each table to the directory (on SSD) and it creates many small parquet files under each folder.

My question is:

Is there a better way to do it? Say I have enough memory to hold 100 temp tables, can I write these 100 tables all at once?
Will the hundreds of small parquet files under each folder affect reading and filtering performance? Is it better to write many smaller tables one by one or one huge table at once?

Many thanks!

T

question from:https://stackoverflow.com/questions/66057250/how-to-efficiently-write-multiple-pyarrow-tables-1-000-tables-to-a-partitione

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T03:09:23+0000

To answer your questions:

This is a bit of an opinionated question, and it is not very well suited for stack overflow. Your approach isn't bad, for this type of work the simpler the better. If you want to speed up this type of workflow by processing several files in parallel I'd recommend using a framework like dask or luigi.
Assuming your source files are a random sample of your partition columns, then for every file you load and save to parquet, you'll have a new parquet file in each partitions. So you may end up having up to 7000 parquet files per partition. This is because write_to_dataset adds a new file to each partition each time it is called (instead of appending to the existing file). This can impact performance negatively. If you find this to be problem, you can "defragment" the data set. For this you load partitions one by one and save them to a new data set. Then you'll have a data set with only one file per partition.

Categories

python - How to efficiently write multiple pyarrow tables (>1,000 tables) to a partitioned parquet dataset?

python - How to efficiently write multiple pyarrow tables (>1,000 tables) to a partitioned parquet dataset?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags