Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
290 views
in Technique[技术] by (71.8m points)

python - DataFrame algebra in Pandas

Say I have two dataframes

df1
df2

that I can join on df1_keys and df2_keys.

I would like to do:

  1. (A-B)
  2. (A-B) U (B-A)

with A=df1 and B=df2.

From what I read on the documentation, the how argument for pd.merge supports the following options:

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
        left: use only keys from left frame (SQL: left outer join)
        right: use only keys from right frame (SQL: right outer join)
        outer: use union of keys from both frames (SQL: full outer join)
        inner: use intersection of keys from both frames (SQL: inner join)

but none of them gives us directly the set operations 1 and 2 above.

For reference, below is the corresponding reference for SQL (from this thread):

enter image description here

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Although these aren't supported directly, they can be achieved by tweaking with the indexes before attempting the join...

You can do set minus using the - operator:

In [11]: ind = pd.Index([1, 2, 3])

In [12]: ind2 = pd.Index([3, 4, 5])

In [13]: ind - ind2
Out[13]: Int64Index([1, 2], dtype='int64')

and set union with the | and intersection with &:

In [14]: ind | ind2
Out[14]: Int64Index([1, 2, 3, 4, 5], dtype='int64')

In [15]: ind & ind2
Out[15]: Int64Index([3], dtype='int64')

So if you have some DataFrames with these indexes, you can reindex before you join:

In [21]: df = pd.DataFrame(np.random.randn(3), ind, ['a'])  #?ind = df.index

In [22]: df2 = pd.DataFrame(np.random.randn(3), ind2, ['b'])  # ind2 = df2.index

In [23]: df.reindex(ind & ind2)
Out[23]:
          a
3  1.368518

So now you can build up whatever join you want:

In [24]: df.reindex(ind & ind2).join(df2.reindex(ind & ind2))  # equivalent to inner
Out[24]:
          a         b
3  1.368518 -1.335534

In [25]: df.reindex(ind - ind2).join(df2.reindex(ind - ind2))  # join on A set minus B
Out[25]:
          a   b
1  1.193652 NaN
2  0.064467 NaN

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...