I have some big files (around 7,000 in total, 4GB each) in other formats that I want to store into a partitioned (hive) directory using the pyarrow.parquet.write_to_dataset() for fast query.
Currently, I am looping over all the files using the following process:
import pyarrow as pa
import pyarrow.parquet as pq
for each_file in file_list:
ndarray_temp = reader(each_file)
table_temp = pa.Table.from_arrays(ndarray_temp)
pq.write_to_dataset(table_temp, root_path='xxx', partition_cols=[...])
This is quite slow as pq.write_to_dataset() takes about 27s to write each table to the directory (on SSD) and it creates many small parquet files under each folder.
My question is:
Is there a better way to do it? Say I have enough memory to hold 100 temp tables, can I write these 100 tables all at once?
Will the hundreds of small parquet files under each folder affect reading and filtering performance? Is it better to write many smaller tables one by one or one huge table at once?
Many thanks!
T
question from:
https://stackoverflow.com/questions/66057250/how-to-efficiently-write-multiple-pyarrow-tables-1-000-tables-to-a-partitione 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…