Assume that you have this dataset:
import pandas as pd
data = pd.DataFrame({'product_code': ['1', '2', '3', '4'],
'technology_type': ['4G, 4G LAA, 5G NR',
'4G,4G CBRS,5G FIXED',
'4G, 5G, NR',
'4G, NR']},
columns=['product_code', 'technology_type'])
Output:
product_code technology_type
1 4G, 4G LAA, 5G NR
2 4G,4G CBRS,5G FIXED
3 4G, 5G, NR
4 4G, NR
First, your data should contain one technology_type
category at a time.
cleaned = data.set_index('product_code').technology_type.str.split(',', expand=True).stack()
Output:
product_code
1 0 4G
1 4G LAA
2 5G NR
2 0 4G
1 4G CBRS
2 5G FIXED
3 0 4G
1 5G
2 NR
4 0 4G
1 NR
Then you can able to apply get_dummies()
and merge
back to your data.
technology_type_dummies = pd.get_dummies(cleaned).groupby(level=0).sum()
newData = data.merge(technology_type_dummies, left_on='product_code', right_index=True)
Output:
product_code technology_type 4G LAA 5G 5G NR NR 4G 4G CBRS 5G FIXED
1 4G, 4G LAA, 5G NR 1 0 1 0 1 0 0
2 4G,4G CBRS,5G FIXED 0 0 0 0 1 1 1
3 4G, 5G, NR 0 1 0 1 1 0 0
4 4G, NR 0 0 0 1 1 0 0
Remember to remove white space in the beginning and in the end of the column name just in case.
newData.columns = newData.columns.str.strip()
Then you can drop the technology_type
column. The data type of the dummy columns is an integer so it will not exist in features_to_encode
in your code.