Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
351 views
in Technique[技术] by (71.8m points)

python - How to merge two dataframe with column in overlapping datetime range

Following on from this question How to join two dataframes for which column values are within a certain range?. But my datetime range column may overlap with each other.

For example:

>>>df_1

  timestamp              A           B
0 2019-07-14 05:31:00    0.020228    0.026572
1 2019-07-14 06:32:00    0.057780    0.175499
2 2019-07-14 07:02:00    0.076623    0.875499

>>>df_2

  start                  end                    event
0 2019-07-14 05:30:00    2019-07-14 06:30:00    E1
1 2019-07-14 06:00:00    2019-07-14 07:00:00    E2
2 2019-07-14 06:30:01    2019-07-14 07:30:00    E3
3 2019-07-14 07:30:01    2019-07-14 08:30:00    E4

I want to find A of df_1 in its interval for df_2. The result I expect are as follows:

  start                  end                    event timestamp             A
0 2019-07-14 05:30:00    2019-07-14 06:30:00    E1    2019-07-14 05:31:00    0.020228
1 2019-07-14 06:00:00    2019-07-14 07:00:00    E2    2019-07-14 06:32:00    0.057780
2 2019-07-14 06:30:01    2019-07-14 07:30:00    E3    2019-07-14 06:32:00    0.057780
3 2019-07-14 06:30:01    2019-07-14 07:30:00    E3    2019-07-14 07:02:00    0.076623

I followed the answers in the links above, but I didn't the pandas method to achieve my goal. The following mistakes occurred when I tried to use the answer with the highest votes.

KeyError: 'indexer does not intersect a unique set of intervals'

Can anybody help me? Thanks in advance.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

it can also be done using numpy broadcast and boolean indexing like below

## load sample data
df1 = pd.DataFrame([('0', '2019-07-14 05:31:00', '0.020228', '0.026572'), ('1', '2019-07-14 06:32:00', '0.057780', '0.175499'), ('2', '2019-07-14 07:02:00', '0.076623', '0.875499')], columns=('id', 'timestamp', 'A', 'B'))
df2 = pd.DataFrame([('0', '2019-07-14 05:30:00', '2019-07-14 06:30:00', 'E1'), ('1', '2019-07-14 06:00:00', '2019-07-14 07:00:00', 'E2'), ('2', '2019-07-14 06:30:01', '2019-07-14 07:30:00', 'E3'), ('3', '2019-07-14 07:30:01', '2019-07-14 08:30:00', 'E4')], columns=('id', 'start', 'end', 'event'))

df1["timestamp"] = pd.to_datetime(df1["timestamp"])
df2["start"] = pd.to_datetime(df2["start"])
df2["end"] = pd.to_datetime(df2["end"])

Solution

## df2[["start"]] is a column vector of size m and df1.timestamp.values is row 
## vector of size n then broad cast will result matrix of shape m,n which is 
## result of comparing each pair of m and n
compare = (df2[["start"]].values<df1.timestamp.values) & (df2[["end"]].values>df1.timestamp.values)

## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(df1)*len(df2))[compare.ravel()]


## calculate row and column index from cell number
pd.concat([df2.iloc[ind//len(df1)].reset_index(drop=True), df1.iloc[ind%len(df1)].reset_index(drop=True)], axis=1, sort=False)

Result

    start               end                event    timestamp             A  B
0   2019-07-14 05:30:00 2019-07-14 06:30:00 E1  2019-07-14 05:31:00 0.020228    0.026572
1   2019-07-14 06:00:00 2019-07-14 07:00:00 E2  2019-07-14 06:32:00 0.057780    0.175499
2   2019-07-14 06:30:01 2019-07-14 07:30:00 E3  2019-07-14 06:32:00 0.057780    0.175499
3   2019-07-14 06:30:01 2019-07-14 07:30:00 E3  2019-07-14 07:02:00 0.076623    0.875499

Edit
in response to comment from @baccandr here is some more explanation how the indexing working.

After comparison we get the compare matrix with boolean values like below

array([[ True, False, False],
       [False,  True, False],
       [False,  True,  True],
       [False, False, False]])
  • you can think of this matrix as table with column representing index of df1 which is (0,1,2) and rows representing index of df2 which is (0,1,2,3)
  • The value in cell is True if corresponding row in df1 and df2 mets the condition e.g row 0 of df1 and row 0 of df2; row 2 of df1 and row 1 of df2 meets the condition
  • To find rows meeting the condition from df1 and df2 separately we can use compare as index directly like df1[compare.T] and df2[compare] but it will not give rows in correct order for pairing. This will give row in increasing order of index for both which will not always be true.
  • Now our aim here is to get indxes of both dataframes which meets the condition, in correct order. so what we need is index of df1 [0, 1, 2, 2] and index of df2 [0,1,1,2]. Using those indexes in df1 and df2 we get the match in correct order from both
  • so what we did here is count cells of matrix from left to right then down then again left two right giving a unique number to each cell. Then filter the cells where conditions meets and convert that to index for df1 and df2.

As the matrix represents index for 2 dataframes in matrix format I think np.where may not work. Another way we can do it is use compare as index for df2 and only find index of df1 like below

this will repeat index of df1 for each row of df2 and find the index of df1 in order with df2

ind_df1 = np.tile(np.arange(len(df1)), len(df2))[compare.ravel()]
pd.concat([df2[compare].reset_index(drop=True), df1.iloc[ind_df1].reset_index(drop=True)], axis=1, sort=False)

I hope this makes it clear, if you have some other idea then I would love to see it in comment or as answer


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...