I need to prepare .parquet file using Python, so this is my code:
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import sys
import mysql.connector
import json
def write_table(databaseServer, databaseDatabase, databaseUser, databasePassword, sql, fileName):
result = 0;
frame = {};
convert_to_double = {};
try:
database = mysql.connector.connect(host=databaseServer, database=databaseDatabase, user=databaseUser, password=databasePassword)
try:
cursor = database.cursor(buffered=False)
cursor.execute(sql)
for desc in cursor.description:
name = desc[0]
type = desc[1]
frame[name] = [];
if mysql.connector.FieldType.get_info(type) == 'NEWDECIMAL':
convert_to_double[name] = True;
records = cursor.fetchall()
for record in records:
i = 0;
for name in cursor.column_names:
if name in convert_to_double:
frame[name].append(float(record[i]))
else:
frame[name].append(record[i])
i += 1
result += 1;
finally:
cursor.close()
finally:
database.close()
if result == 0:
return result;
df = pd.DataFrame(frame)
table = pa.Table.from_pandas(df)
pq.write_table(table, fileName)
return result
Problem is I don't have much memory on a server, but here in this code I load entire SQL query into columns that stores in memory. Also I cannot partition data to load less data (it's already partitioned by months, I prepare .parquet files por Amazon Athena).
My questions are:
- is it a way to reduce memory usage while preparing .parquet files this way?
- is it a way to query data from database column by column and then write all these columns into single .parquet?
- will it helps to reduce memory usage if I write CSV first on disk and then tell pandas to convert this .csv into .parquet?
Thanks.
question from:
https://stackoverflow.com/questions/65885183/pyarrow-parquet-write-table-memory-usage 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…