Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
231 views
in Technique[技术] by (71.8m points)

python - RandomForestClassifier is throwing error: one field contains comma-separated values

I am trying to fit a RandomForestClassifier, like this.

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(col_trans, rf_classifier)
pipe.fit(X_train, y_train)

I'm getting this error:

ValueError: Found unknown categories ['4G, 4G LAA, 5G NR', '4G,4G CBRS,5G FIXED'] in column 3 during transform

The field named 'technology_type' contains comma separated values, like this: 4G, 5G, NR

How can I handle these comma separated values? I suppose I could eliminate that field, but I really want to include it as an independent variable for X.

Here is all of my code.

df_fuze = pd.read_sql("""select * from fuze""", conn_connection)

# copy features to new DF
fuze = df_fuze[['territory',
        'submarket',
        'local_market',
        'technology_type',
        'project_type',
        'modification_type',
        'objective',
        'construction_completed_days']]

fuze.head()

# set dependent variable
y = fuze['construction_completed_days']

# set the independent variables
X = fuze.drop('construction_completed_days', 1)

seed = 50  # so that the result is reproducible
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.333, random_state = seed)


X_train = X_train.fillna('na')
X_test = X_test.fillna('na')

features_to_encode = list(X_train.select_dtypes(include = ['object']).columns) 
# Or alternatively, 
# features_to_encode = X_train.columns[X_train.dtypes==object].tolist()

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
col_trans = make_column_transformer(
                        (OneHotEncoder(),features_to_encode),
                        remainder = "passthrough"
                        )

from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(
                      min_samples_leaf=50,
                      n_estimators=150,
                      bootstrap=True,
                      oob_score=True,
                      n_jobs=-1,
                      random_state=seed,
                      max_features='auto')

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(col_trans, rf_classifier)
pipe.fit(X_train, y_train)

The error occurs after trying to fit the X & y variables.

I am following the example from here.

https://towardsdatascience.com/my-random-forest-classifier-cheat-sheet-in-python-fedb84f8cf4f

question from:https://stackoverflow.com/questions/65930597/randomforestclassifier-is-throwing-error-one-field-contains-comma-separated-val

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Assume that you have this dataset:

import pandas as pd

data = pd.DataFrame({'product_code': ['1', '2', '3', '4'],
                 'technology_type': ['4G, 4G LAA, 5G NR',
                            '4G,4G CBRS,5G FIXED',
                            '4G, 5G, NR',
                            '4G, NR']},
                columns=['product_code', 'technology_type'])

Output:

product_code    technology_type
1               4G, 4G LAA, 5G NR
2               4G,4G CBRS,5G FIXED
3               4G, 5G, NR
4               4G, NR

First, your data should contain one technology_type category at a time.

cleaned = data.set_index('product_code').technology_type.str.split(',', expand=True).stack()

Output:

product_code   
1             0          4G
              1      4G LAA
              2       5G NR
2             0          4G
              1     4G CBRS
              2    5G FIXED
3             0          4G
              1          5G
              2          NR
4             0          4G
              1          NR

Then you can able to apply get_dummies() and merge back to your data.

technology_type_dummies = pd.get_dummies(cleaned).groupby(level=0).sum()
newData = data.merge(technology_type_dummies, left_on='product_code', right_index=True)

Output:

product_code    technology_type     4G LAA  5G  5G NR   NR     4G   4G CBRS    5G FIXED
1               4G, 4G LAA, 5G NR   1       0   1       0      1    0          0
2               4G,4G CBRS,5G FIXED 0       0   0       0      1    1          1
3               4G, 5G, NR          0       1   0       1      1    0          0
4               4G, NR              0       0   0       1      1    0          0

Remember to remove white space in the beginning and in the end of the column name just in case.

newData.columns = newData.columns.str.strip()

Then you can drop the technology_type column. The data type of the dummy columns is an integer so it will not exist in features_to_encode in your code.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...