Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
338 views
in Technique[技术] by (71.8m points)

(pandas) Drop duplicates based on subset where order doesn't matter

What is the proper way to go from this df:

>>> df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']})
>>> df
      a     b
0  jeff   bob
1   bob  jeff
2  jill  mike

To this:

>>> df2
      a     b
0  jeff   bob
2  jill  mike

where you're dropping a duplicate row based on the items in 'a' and 'b', without regard to the their specific column.

I can hack together a solution using a lambda expression to create a mask and then drop duplicates based on the mask column, but I'm thinking there has to be a simpler way than this:

>>> df['c'] = df[['a', 'b']].apply(lambda x: ''.join(sorted((x[0], x[1]), 
 key=lambda x: x[0]) + sorted((x[0], x[1]), key=lambda x: x[1] )), axis=1)
>>> df.drop_duplicates(subset='c', keep='first', inplace=True)
>>> df = df.iloc[:,:-1]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I think you can sort each row independently and then use duplicated to see which ones to drop.

dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()
df[~dupes]

A faster way to get dupes. Thanks to @DSM.

dupes = df.T.apply(sorted).T.duplicated()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...