Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
597 views
in Technique[技术] by (71.8m points)

python - Fill in missing hours in a pandas dataframe

I have a dataframe that contains hourly data:

area     date         hour      output
H1       2018-07-01   07:00:00  150
H1       2018-07-01   08:00:00  150
H1       2018-07-01   09:00:00  100
H1       2018-07-01   11:00:00  150
H2       2018-07-01   09:00:00  100
H2       2018-07-01   10:00:00   50
H2       2018-07-01   11:00:00   50
H2       2018-07-01   12:00:00  150

but the data only contains row for the hours when there was output, how can I fill in the missing hours for each area with output 0? For example add two rows for H1:

area     date         hour      output
H1       2018-07-01   10:00:00  0
H1       2018-07-01   12:00:00  0

I can assume that the min and max hour for all areas are the beginning and end of the sample period (in this case 7:00:00 and 12:00:00)

Right now, I'm creating a dataframe containing all the hours from 7:00 to 12:00 for each area and then doing a merge of my data with that dataframe, and then filling the NaN with 0s. This is very slow as my data set can have millions of rows.

Is there any better way of doing this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can create a date range of min and max and merge your dataframe with the existing and fill values with null

df

    area    date    hour    output
0   H1  2018-07-01 07:00:00 07:00:00    150
1   H1  2018-07-01 08:00:00 08:00:00    150
2   H1  2018-07-01 09:00:00 09:00:00    100
6   H2  2018-07-01 11:00:00 11:00:00    50
7   H2  2018-07-01 12:00:00 12:00:00    150

df = pd.DataFrame(pd.date_range(pd.to_datetime(df['date'] +' ' + df['hour']).min(),pd.to_datetime(df['date'] +' ' + df['hour']).max(),freq='H'),columns= ['date']).merge(df,on=['date'],how='outer').fillna(0)
df.hour = df.date.dt.strftime('%H:%M:%S')
df.date = df.date.dt.strftime('%d-%m-%Y')
df

Out:

date    area    hour    output
0   01-07-2018  H1  07:00:00    150.0
1   01-07-2018  H1  08:00:00    150.0
2   01-07-2018  H1  09:00:00    100.0
3   01-07-2018  0   10:00:00    0.0
4   01-07-2018  H2  11:00:00    50.0
5   01-07-2018  H2  12:00:00    150.0

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...