Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
5.1k views
in Technique[技术] by (71.8m points)

When using pandas dataframe.to_csv(), with compression='zip', it creates a zip file with two archive files with the EXACT same name

I am trying to save OHLCV (stock pricing) data from a dataframe into a single zipped csv file as follows. My test data is ohlcvData.csv, which I read into a dataframe with

import pandas as pd

df = pd.read_csv('ohlcvData.csv', header=None, names=['datetime', 'open', 'high', 'low', 'close', 'volume'], index_col='datetime')

and when I try to write it to a zip file like so (following stackoverflow.com/questions/55134716) :

df.to_csv('ohlcvData.zip', header=False, compression=dict(method='zip', archive_name='ohlcv.csv'))

I get the following warning ...

C:Program Files (x86)Microsoft Visual StudioSharedPython37_64libzipfile.py:1473: UserWarning: Duplicate name: 'ohlcv.csv' return self._open_to_write(zinfo, force_zip64=force_zip64)

and the resultant ohlcvData.zip file contains two files, both named ohlcv.csv, each containing a portion of the results.

When I try to read the zip file back into a dataframe ...

dfRead = pd.read_csv(ohlcvData.zip', header=None, names=['datetime', 'open', 'high', 'low', 'close', 'volume'], index_col='datetime')

... I get the following error...

 *File "C:UsersjeffmAppDataRoamingPythonPython37site-packagespandasiocommon.py", line 618, in get_handle
    "Multiple files found in ZIP file. "
ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['ohlcv.csv', 'ohlcv.csv']*

However, when I reduce the number of rows in the input file from 200 to around 175 (for this file structure it varies slightly how many lines I have to remove depending on the data), it works and produces a zip file, containing one csv file, which can be loaded back into a dataframe without error. I have tried many different files, with different data and formats and I still get the same result -- any file with over (approx) 175 lines fails and any file with less works fine. So it looks like its splitting the file after a certain size, but from the docs there doesn't appear to be such a setting. Any help on this would be appreciated. Thanks.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This appears to be a bug introduced in 1.2.0, I created a minimal reproducing example and posted an issue: https://github.com/pandas-dev/pandas/issues/39190

import pandas as pd

# enough data to cause chunking into multiple files
n_data = 100000
df = pd.DataFrame(
    {'name': ["Raphael"]*n_data,
    'mask': ["red"]*n_data,
    'weapon': ["sai"]*n_data,
    }
)

compression_opts = dict(method='zip', archive_name='out.csv')
df.to_csv('out.csv.zip', index=False, compression=compression_opts)

# reading back the data produces an error
r_df = pd.read_csv("out.csv.zip")

# passing in compression_opts doesn't work either
r_df = pd.read_csv("out.csv.zip", compression=compression_opts)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...