Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
282 views
in Technique[技术] by (71.8m points)

python - pyarrow.parquet.write_table: memory usage

I need to prepare .parquet file using Python, so this is my code:

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import sys
import mysql.connector
import json

def write_table(databaseServer, databaseDatabase, databaseUser, databasePassword, sql, fileName):
    result = 0;
    frame = {};
    convert_to_double = {};

    try:
        database = mysql.connector.connect(host=databaseServer, database=databaseDatabase, user=databaseUser, password=databasePassword)

        try:
            cursor = database.cursor(buffered=False)

            cursor.execute(sql)

            for desc in cursor.description:
                name = desc[0]
                type = desc[1]

                frame[name] = [];

                if mysql.connector.FieldType.get_info(type) == 'NEWDECIMAL':
                    convert_to_double[name] = True;

            records = cursor.fetchall()

            for record in records:
                i = 0;

                for name in cursor.column_names:
                    if name in convert_to_double:
                        frame[name].append(float(record[i]))
                    else:
                        frame[name].append(record[i])

                    i += 1

                result += 1;
        finally:
            cursor.close()
    finally:
        database.close()

    
    if result == 0:
        return result;
    
    df = pd.DataFrame(frame)
    table = pa.Table.from_pandas(df)
    pq.write_table(table, fileName)

    return result

Problem is I don't have much memory on a server, but here in this code I load entire SQL query into columns that stores in memory. Also I cannot partition data to load less data (it's already partitioned by months, I prepare .parquet files por Amazon Athena).

My questions are:

  • is it a way to reduce memory usage while preparing .parquet files this way?
  • is it a way to query data from database column by column and then write all these columns into single .parquet?
  • will it helps to reduce memory usage if I write CSV first on disk and then tell pandas to convert this .csv into .parquet?

Thanks.

question from:https://stackoverflow.com/questions/65885183/pyarrow-parquet-write-table-memory-usage

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

At the momment you are:

  • Loading data into memory (in vectors)
  • Converting the data to a df
  • Storing the data in parquet

This strategy only works if all the data can be stored in memory.

You could instead write smaller batches of data to the parquet file, using ParquetWriter

To answer your quesions one by one:

  • You can reduce the memory foot print by using numpy arrays insead of python arrays. They are much more efficient in terms of memory. Also You should not call fetchall() but stream records from the cursor instead.

  • I don't think there is a way to write parquet file columns by columns. You address the scale issue by writing smaller groups of rows.

  • CSV data representation is less efficient than parquet or pandas, I'm not sure it would hep.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...