Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
175 views
in Technique[技术] by (71.8m points)

python - How to slice pd.dataframe several times without using for_loop?

Question

  • I would like to slice the dataframe (customer_df) based on other two dataframes (start_time_df and end_time_df) without using for loop.

    • The goal is to know which customers are handled by the certain employee.
    • customer_df: showing the names of customers, locations and the time where and when they bought products.
    • start_time_df & end_time_df: A time sheet of the certain employee's start and end time of work.
  • I have managed to do this using for loop, but if the dataframe gets large, it would take much time using for loop. As such I am looking for how to do this without using the for loop.

  • Example and my solution(for loop ver.) is provided below.

Example

import pandas as pd
from datetime import datetime

customer_df = pd.DataFrame({'customer': ['A','B','C','A','D','E','K','A','D','F','P','J'],
                            'location': ['NY','TX','NY','UT','MA','NV','NY','TX','NY','UT','MA','NV']},
                           index = [datetime(2020,5,1,9), datetime(2020,5,1,11), datetime(2020,5,1,12),
                                    datetime(2020,5,1,18), datetime(2020,5,2,5), datetime(2020,5,2,10), 
                                    datetime(2020,5,2,19), datetime(2020,5,3,2), datetime(2020,5,3,10),
                                    datetime(2020,5,3,18), datetime(2020,5,4,20), datetime(2020,5,4,22)])

start_time_df = pd.DataFrame({'start_time':[datetime(2020,5,1,8), datetime(2020,5,2,8), datetime(2020,5,3,5)]})

end_time_df = pd.DataFrame({'end_time':[datetime(2020,5,1,17), datetime(2020,5,2,17), datetime(2020,5,3,20)]})

customer_df
>>>                  customer  location
2020-05-01 09:00:00      A        NY
2020-05-01 11:00:00      B        TX
2020-05-01 12:00:00      C        NY
2020-05-01 17:00:00      A        UT
2020-05-02 05:00:00      D        MA
2020-05-02 10:00:00      E        NV
2020-05-02 19:00:00      K        NY
2020-05-03 02:00:00      A        TX
2020-05-03 08:00:00      D        NY
2020-05-03 15:00:00      F        UT
2020-05-04 20:00:00      P        MA
2020-05-04 22:00:00      J        NV
  • I have managed to slice the customer_df using for loop, but is there a way to do this without using for loop?
sliced_df_list = []  # to store slices

start_time_series = start_time_df.loc[:,'start_time']
end_time_series = end_time_df.loc[:,'end_time']

for start_time, end_time in zip(start_time_series, end_time_series):
    sliced_df_list.append(customer_df.loc[start_time:end_time,:])
    
pd.concat(sliced_df_list)
>>>                  customer   location
2020-05-01 09:00:00      A         NY
2020-05-01 11:00:00      B         TX
2020-05-01 12:00:00      C         NY
2020-05-02 10:00:00      E         NV
2020-05-03 10:00:00      D         NY
2020-05-03 18:00:00      F         UT
question from:https://stackoverflow.com/questions/65642366/how-to-slice-pd-dataframe-several-times-without-using-for-loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

As a preparatory step, compute:

indTpl = pd.IntervalIndex.from_arrays(start_time_df.start_time,
    end_time_df.end_time, closed='neither').to_tuples()

It is a counterpart of:

zip(start_time_series, end_time_series)

but you avoid explicit creation of start_time_series and end_time_series.

Then, to get the result, run:

result = pd.concat([ customer_df.loc[slice(*it)] for it in indTpl ])

This code runs about 25 % faster than yours. For bigger source DataFrames the difference can be more apparent. Check on your own.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...