I am trying to use RandomForestRegressor using python. I know that for numerical columns there are no need to scale since only one column that lead to most information gain is used to split data. However it seems like we still need to convert categorical values to numbers so that our machine can understand.
I want to compare between labelEncoder and onehot encoding and want to understand reason why one would be preferred.
I am using dataset from https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data and trying to predict PM2.5 value
my dataframe looks like this
year month day hour PM2.5 PM10 SO2 NO2 CO O3 TEMP PRES DEWP RAIN wd WSPM station
0 2013 3 1 0 4.0 4.0 4.0 7.0 300.0 77.0 -0.7 1023.0 -18.8 0.0 NNW 4.4 Aotizhongxin
1 2013 3 1 1 8.0 8.0 4.0 7.0 300.0 77.0 -1.1 1023.2 -18.2 0.0 N 4.7 Aotizhongxin
2 2013 3 1 2 7.0 7.0 5.0 10.0 300.0 73.0 -1.1 1023.5 -18.2 0.0 NNW 5.6 Aotizhongxin
3 2013 3 1 3 6.0 6.0 11.0 11.0 300.0 72.0 -1.4 1024.5 -19.4 0.0 NW 3.1 Aotizhongxin
4 2013 3 1 4 3.0 3.0 12.0 12.0 300.0 72.0 -2.0 1025.2 -19.5 0.0 N 2.0 Aotizhongxin
First I use one-hot encoding
ohe_df = pd.get_dummies(data=df, columns=["wd", "station"])
y = ohe_df["PM2.5"].values
X = ohe_df.drop(columns=["PM2.5"]).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
rf_reg = RandomForestRegressor(n_estimators=100,
criterion="mse",
n_jobs=-1,
random_state=42)
rf_reg.fit(X_train, y_train)
train_pred_y = rf_reg.predict(X_train)
test_pred_y = rf_reg.predict(X_test)
print(f"train_MAE = {mean_absolute_error(y_train, train_pred_y)}")
print(f"test_MAE = {mean_absolute_error(y_test, test_pred_y)}")
>>>train_MAE = 3.7268903322031877
>>>test_MAE = 10.108332295400825
Then using same rf_reg I train and predict after using label encoder
le = LabelEncoder()
le_df = df.copy()
le_df["wd"] = le.fit_transform(df["wd"])
le_df["station"] = le.fit_transform(df["station"])
y = le_df["PM2.5"].values
X = le_df.drop(columns=["PM2.5"]).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
rf_reg.fit(X_train, y_train)
train_pred_y = rf_reg.predict(X_train)
test_pred_y = rf_reg.predict(X_test)
print(f"train_MAE = {mean_absolute_error(y_train, train_pred_y)}")
print(f"test_MAE = {mean_absolute_error(y_test, test_pred_y)}")
>>>train_MAE = 3.765413599883373
>>>test_MAE = 10.189870188659498
From this comparison one-hot encoding seems to perform better but my question is, is it right method to compare different encoding methods? And If yes, why does labelEncoding performing worse(even though by little bit) than one hot encoding?