I have a dataframe of several thousand rows with columns of geography, response_dates and True/False for in_compliance.
(我有几千行的数据框,其中包含地理列,response_dates和in / compliance的True / False。)
df = pd.DataFrame( {
"geography" : ["Baltimore", "Frederick", "Annapolis", "Hagerstown", "Rockville" , "Salisbury","Towson","Bowie"] ,
"response_date" : ["2018-03-31", "2018-03-30", "2018-03-28", "2018-03-28", "2018-04-02", "2018-03-30","2018-04-07","2018-04-02"],
"in_compliance" : [True, True, False, True, False, True, False, True]})
I want to add a column that represents the number of True values for the most recent four dates in the response_date column, including the response_date for that row.
(我想在response_date列中添加代表最近四个日期的True值数量的列,包括该行的response_date。)
An example of the desired output: (所需输出的示例:)
geography response_date in_compliance Past_4_dates_sum_of_true
Baltimore 2018-03-24 True 1
Baltimore 2018-03-25 False 1
Baltimore 2018-03-26 False 1
Baltimore 2018-03-27 False 1
Baltimore 2018-03-30 False 0
Baltimore 2018-03-31 True 1
Baltimore 2018-04-01 True 2
Baltimore 2018-04-02 True 3
Baltimore 2018-04-03 False 3
Baltimore 2018-04-06 True 3
Baltimore 2018-04-07 True 3
Baltimore 2018-04-08 False 2
I've tried different approaches to groupby and rolling.
(我尝试了不同的分组和滚动方法。)
But I get results that are not what I expect and need. (但是我得到的结果不是我期望和期望的。)
df.groupby('city').resample('d').sum().fillna(0).groupby('city').rolling(4,min_periods=1).sum()
This was another approach I took:
(这是我采取的另一种方法:)
df1 = df.groupby(['city']).apply(lambda x: x.set_index('response_date').resample('1D').first())
df2 = df1.groupby(level=0)['in_compliance']
.apply(lambda x: x.shift().rolling(min_periods=1,window=4).count())
.reset_index(name='Past_4_dates_sum_of_true')
ask by JamesMiller translate from so