I am trying out machine learning in python to predict future values. My data (X1, ... , X8, Y) can be seen in the attached figure.
Description of my data
For testing, I have started by using sklearn RandomForestRegressor because the value that I am trying to predict is a float. My input data is originally a mix of data types (strings, floats, integers and True/False statements). All of which I have converted to numbers. Each string is represented my a unique integer. Each true/false is represented by either 1 or 0.
The examples I find online are usually either numbers (regressor problems?) or strings (classifier problems?).
Is this the correct approach for mixed input data types?
I am greatful for any input.
'''
X1,X2,X3,X4,X5,X6,X7,X8,Y
93,150,18,10,63,641.1024566,9,0,49.87777112
93,371,19,3,62,641.1024566,1,0,48.85200719
93,150,19,4,62,641.1024566,12,1,41.67165968
93,196,19,6,62,641.1024566,11,1,47.1851408
93,416,19,9,414,641.1024566,5,1,46.67225884
93,196,19,9,375,647.0940683,7,0,43.35530258
93,416,19,10,428,641.1024566,1,1,46.80047933
93,196,19,10,430,641.1024566,6,0,50.19832235
93,196,19,11,579,629.1192331,4,1,46.55482325
93,416,20,2,422,641.1024566,3,1,48.21090473
93,196,20,3,429,641.1024566,10,1,47.95446375
93,150,20,3,429,641.1024566,11,1,48.08268424
93,196,20,4,430,641.1024566,12,1,47.69802277
93,196,20,5,427,641.1024566,11,1,46.99281007
93,196,20,5,424,641.1024566,10,1,47.31336129
93,206,20,6,6,641.1024566,2,1,47.1851408
93,196,20,6,427,491.312163,11,1,35.66926303
93,196,20,9,430,641.1024566,4,1,47.24925105
93,416,20,8,362,641.1024566,8,1,48.08268424
'''
# Normalize input values
predictors = list(set(list(df1.columns))-set(target_column))
maximumPredictor = df1[predictors].max()
df1[predictors] = df1[predictors]/maximumPredictor
df1.describe().transpose()
df2[predictors] = df2[predictors]/maximumPredictor
df2.describe().transpose()
X = df1[predictors].values
y = df1[target_column].values
X_predict = df2[predictors].values
# Split data to evaluate the model with a portion of input data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
# This is the regressor
regressor = RandomForestRegressor(n_estimators=50,
)
# Train regressor
regressor.fit(X_train, y_train.ravel())
# Make a prediction from test data
y_pred = regressor.predict(X_test)
question from:
https://stackoverflow.com/questions/66063190/machine-learning-structure-of-input-data