Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
119 views
in Technique[技术] by (71.8m points)

python - Replacing Duplicates in a column of list type in Pandas

Background Info: I have a dataframe df which has various columns out of which the focus is the column called 'genres'

Goal:

The problem can be seen in this image, there are entries where duplicates are found for example '[Drama, Romance]' and '[Romance, Drama]' are the same thing

  1. Now the objective is to remove/replace duplicates programmatically such that the variation is replaced with its equivalent form such as this.

Example:

'[Drama, Romance]' and '[Romance, Drama]'

Now [Romance, Drama] is replaced with [Drama, Romance] or vice versa instead of complete removal we just replace the list's content

Output - Before Replacing Duplicates '[Drama, Romance]' and '[Romance, Drama]'

Expected Output - After Replacing Duplicates '[Drama, Romance]'

  1. Filter df column 'genres' to only include genres with list entries not exceeding 3 genres for example remove any rows with more than 3 genres. Example of acceptable result in 'genres' column:
  • [Romance, Drama, Comedy]
  • [Romance, Drama]
  • [Drama]

I have tried the following:

#to delist the 'genres' column
df['genres'] = df.genres.apply(', '.join)

# code sample of manually replaced duplicated content in genres column
df['genres'] = df['genres'].str.replace("Romance, Drama","Drama, Romance")
df['genres'] = df['genres'].str.replace("Drama, Comedy","Comedy, Drama")

The above code works but it is done manually for individual duplicates so I want to find a way to code this for all of the duplicates found in 'genres' column of df


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Assuming list data type for each row within the column:

  1. You can first sort the list per row with sorted

  2. Then filter the rows of the dataframe with loc and get the value_counts()

    df['genres'] = df['genres'].apply(lambda x: sorted(x))
    df.loc[df['genres'].apply(lambda x: len(x) <= 3), 'genres'].value_counts() 
    

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...