Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
534 views
in Technique[技术] by (71.8m points)

python - pandas.read_csv() can apply different date formats within the same column! Is it a known bug? How to fix it?

I have realised that, unless the format of a date column is declared explicitly or semi-explicitly (with dayfirst), pandas can apply different date formats to the same column, when reading a csv file! One row could be dd/mm/yyyy and another row in the same column mm/dd/yyyy! Insane doesn't even come close to describing it! Is it a known bug?

To demonstrate: the script below creates a very simple table with the dates from January 1st to the 31st, in the dd/mm/yyyy format, saves it to a csv file, then reads back the csv.

I then use pandas.DatetimeIndex to extract the day. Well, the day is 1 for the first 12 days (when month and day were both < 13), and 13 14 etc afterwards. How on earth is this possible?

The only way I have found to fix this is to declare the date format, either explicitly or just with dayfirst=True. But it's a pain because it means I must declare the date format even when I import csv with the best-formatted dates ever! Is there a simpler way?

This happens to me with pandas 0.23.4 and Python 3.7.1 on Windows 10

import numpy as np
import pandas as pd
df=pd.DataFrame()
df['day'] =np.arange(1,32)
df['day']=df['day'].apply(lambda x: "{:0>2d}".format(x)  )
df['month']='01'
df['year']='2018'
df['date']=df['day']+'/'+df['month']+'/'+df['year']
df.to_csv('mydates.csv', index=False)

#same results whether you use parse_dates or not
imp = pd.read_csv('mydates.csv',parse_dates=['date'])
imp['day extracted']=pd.DatetimeIndex(imp['date']).day
print(imp['day extracted'])
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

By default it assumes the American dateformat, and indeed switches mid-column without throwing an Error, if that fails. And though it breaks the Zen of Python by letting this Error pass silently, "Explicit is better than implicit". So if you know your data has an international format, you can use dayfirst

imp = pd.read_csv('mydates.csv', parse_dates=['date'], dayfirst=True)

With files you produce, be unambiguous by using an ISO 8601 format with a timezone designator.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...