Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
494 views
in Technique[技术] by (71.8m points)

python - Sum set of values from pandas dataframe within certain time frame

I have a fairly complicated question. I need to select rows from a data frame within a certain set of start and end dates, and then sum those values and put them in a new dataframe.

So I start off with with data frame, df:

import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
                   'time_stamp': dates,
                   'value': random.choice(range(2,60))
                  })

and define some start and end dates:

import pandas as pd
start_date = ["2-13-16", "2-23-16", "3-17-16", "3-24-16", "3-26-16", "5-17-16", "5-25-16", "10-10-16", "10-18-16", "10-23-16", "10-31-16", "11-7-16", "11-14-16", "11-22-16", "1-23-17", "1-29-17", "2-06-17", "3-11-17", "3-23-17", "6-21-17", "6-28-17"]
end_date = pd.DatetimeIndex(start_date) + pd.DateOffset(7)

Then what needs to happen is that I need to create a new data frame with weekly_sum which sums the value column of df which occur in between the the start_date and end_date.

So for example, the first row of the new data frame would return the sum of the values between 2-13-16 and 2-20-16. I imagine I'd use groupby.sum() or something similar.

It might look like this:

id      start_date   end_date    weekly_sum
65      2016-02-13   2016-02-20  100

Any direction is greatly appreciated!

P.S. I know my use of random.choice is a little wonky so if you have a better way of generating random numbers, I'd love to see it!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can use

def get_dates(x):
    # Select the df values between start and ending datetime. 
    n = df[(df['time_stamp']>x['start'])&(df['time_stamp']<x['end'])]
    # Return first id and sum of values
    return n['id'].values[0],n['value'].sum()

dates = pd.date_range('20150101 020000',periods=1000)

df = pd.DataFrame({'id': np.random.randint(0,1000,size=(1000,)),
               'time_stamp': dates,
               'value': np.random.randint(2,60,size=(1000,))
              })

ndf = pd.DataFrame({'start':pd.to_datetime(start_date),'end':end_date})
#Unpack and assign values to id and value column
ndf[['id','value']] = ndf.apply(lambda x : get_dates(x),1).apply(pd.Series)
print(df.head(5))
   id          time_stamp  value
0  770 2015-01-01 02:00:00     59
1  781 2015-01-02 02:00:00     32
2  761 2015-01-03 02:00:00     40
3  317 2015-01-04 02:00:00     16
4  538 2015-01-05 02:00:00     20

print(ndf.head(5))

         end      start   id  value
0 2016-02-20 2016-02-13  569    221
1 2016-03-01 2016-02-23   28    216
2 2016-03-24 2016-03-17  152    258
3 2016-03-31 2016-03-24  892    265
4 2016-04-02 2016-03-26  606    244

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...