Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
230 views
in Technique[技术] by (71.8m points)

python - 'pd.read_csv' using variable size chunks: How to stop without try/except?

pd.read_csv(iterator=True) returns an iterator of type TextFileReader. I need to call TextFileReader.get_chunk in order to specify the number of rows to return for each call.

import random
import pandas as pd

chunks = pd.read_csv('file.csv', iterator=True)

try:
    while True:
        chunk = chunks.get_chunk(random.randint(1,3))
        print(chunk)
except StopIteration:
    pass

Question: Is there a way to get rid of the try construction in this code? Said otherwise is there a condition to put in the while statement to indicate the iterator has no more rows to deliver?

Here is some csv content for tests:

"Year", "Score", "Title"
1968, 86, "Greetings"
1970, 17, "Bloody Mama"
1971, 40, "Born to Win"
1973, 98, "Mean Streets"
1973, 88, "Bang the Drum Slowly"
1976, 41, "The Last Tycoon"
1976, 99, "Taxi Driver"


Notes

I know the for loop is designed to catch the StopIteration signal, and there is a way to iterate over TextFileReader returned by pd.read_csv but in this case I think I can't manage the variable number of rows returned, it must be fixed:

chunks = pd.read_csv('file.csv',chunksize=3)
for chunk in chunks:
    print(chunk)

Difficulties with the documentation:

For some reason the pandas documentation doesn't provide the documentation of pandas.io.parsers.TextFileReader, the only pseudo-documentation I found is from kite site, and is mostly an empty shell.

It seems also TextFileReader has been a context manager at some time, and this could have been another solution. However this is not the case anymore, in spite the documentation still says it is one, and provides examples which don't work like:

with pd.read_csv("tmp.sv", sep="|", iterator=True) as reader:
    reader.get_chunk(5)
question from:https://stackoverflow.com/questions/65850627/pd-read-csv-using-variable-size-chunks-how-to-stop-without-try-except

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Hope this piece of code would resolve your issue

Generators are much needed when you use chunks.

   def read():
    chunksize = 10000
    with open('Sample.csv','r') as f:
        while True:
            read_data = f.read(chunksize)
            if not read_data:
                break
            yield read_data

When you print the Function, you will get generator object <generator object read_chunks.<locals>.read at 0x0000029DEE9F20C8> You can iterate through for loop to get each row and you can convert it to Dataframe


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...