Background Info: I have a dataframe df
which has various columns out of which the focus is the column called 'genres'
Goal:
The problem can be seen in this image, there are entries where duplicates are found for example '[Drama, Romance]' and '[Romance, Drama]' are the same thing
- Now the objective is to remove/replace duplicates programmatically such that the variation is replaced with its equivalent form such as this.
Example:
'[Drama, Romance]' and '[Romance, Drama]'
Now [Romance, Drama] is replaced with [Drama, Romance] or vice versa instead of complete removal we just replace the list's content
Output - Before Replacing Duplicates '[Drama, Romance]' and '[Romance, Drama]'
Expected Output - After Replacing Duplicates '[Drama, Romance]'
- Filter
df
column 'genres' to only include genres with list entries not exceeding 3 genres for example remove any rows with more than 3 genres. Example of acceptable result in 'genres' column:
- [Romance, Drama, Comedy]
- [Romance, Drama]
- [Drama]
I have tried the following:
#to delist the 'genres' column
df['genres'] = df.genres.apply(', '.join)
# code sample of manually replaced duplicated content in genres column
df['genres'] = df['genres'].str.replace("Romance, Drama","Drama, Romance")
df['genres'] = df['genres'].str.replace("Drama, Comedy","Comedy, Drama")
The above code works but it is done manually for individual duplicates so I want to find a way to code this for all of the duplicates found in 'genres' column of df
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…