Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
365 views
in Technique[技术] by (71.8m points)

python - Optimizing pd.read_sql

conn1 = pyodbc.connect('DSN=LUDP-Training Presto',uid='*****', pwd='****', autocommit=True) 

 sql_query =        "SELECT             zsourc_sy, zmsgeo, salesorg, crm_begdat, zmcn, zrtm, crm_obj_id, zcrmprod, prod_hier, hier_type, zsoldto, zendcst, crmtgqtycv, currency, zwukrs, netvalord, zgtnper,zsub_4_t 
                    FROM                `prd_updated`.`bw_ms_zocsfs05l_udl` 
                    WHERE               zdcgflag = 'DCG' AND crm_begdat >= '20200101' AND zmsgeo IN ('AP', 'LA', 'EMEA', 'NA')"

I have to load the following query into a pandas dataframe but the pd.read_sql statement has been loading for more than a couple hours since the table is > 10 million rows of data. Is there a way to speed this process up?

contract_table = pd.read_sql(sql_query,conn1)
question from:https://stackoverflow.com/questions/65838513/optimizing-pd-read-sql

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can pass a chunksize param to the read_sql function (docs), which turns it into a generator that returns an iterator of dataframes with the specified number of rows.

df_iter = pd.read_sql(sql_query,conn1, chunksize=100)

for df in df_iter:
    for row in df: # 100 rows in each dataframe in this example
        # do work here

Generators are an efficient way of processing data that's too large to all fit in memory at once.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...