What is the proper way to go from this df:
>>> df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']})
>>> df
a b
0 jeff bob
1 bob jeff
2 jill mike
To this:
>>> df2
a b
0 jeff bob
2 jill mike
where you're dropping a duplicate row based on the items in 'a' and 'b', without regard to the their specific column.
I can hack together a solution using a lambda expression to create a mask and then drop duplicates based on the mask column, but I'm thinking there has to be a simpler way than this:
>>> df['c'] = df[['a', 'b']].apply(lambda x: ''.join(sorted((x[0], x[1]),
key=lambda x: x[0]) + sorted((x[0], x[1]), key=lambda x: x[1] )), axis=1)
>>> df.drop_duplicates(subset='c', keep='first', inplace=True)
>>> df = df.iloc[:,:-1]
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…