(pandas) Drop duplicates based on subset where order doesn't matter

Question

Welcome To Ask or Share your Answers For Others

(pandas) Drop duplicates based on subset where order doesn't matter

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

(pandas) Drop duplicates based on subset where order doesn't matter

What is the proper way to go from this df:

>>> df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']})
>>> df
      a     b
0  jeff   bob
1   bob  jeff
2  jill  mike

To this:

>>> df2
      a     b
0  jeff   bob
2  jill  mike

where you're dropping a duplicate row based on the items in 'a' and 'b', without regard to the their specific column.

I can hack together a solution using a lambda expression to create a mask and then drop duplicates based on the mask column, but I'm thinking there has to be a simpler way than this:

>>> df['c'] = df[['a', 'b']].apply(lambda x: ''.join(sorted((x[0], x[1]), 
 key=lambda x: x[0]) + sorted((x[0], x[1]), key=lambda x: x[1] )), axis=1)
>>> df.drop_duplicates(subset='c', keep='first', inplace=True)
>>> df = df.iloc[:,:-1]

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T01:33:47+0000

I think you can sort each row independently and then use duplicated to see which ones to drop.

dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()
df[~dupes]

A faster way to get dupes. Thanks to @DSM.

dupes = df.T.apply(sorted).T.duplicated()

Categories

(pandas) Drop duplicates based on subset where order doesn't matter

(pandas) Drop duplicates based on subset where order doesn't matter

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags