I have a pickle file with all the extracted features from raw dataset and now I am trying to train a XGBoost model on it.
In [4]: df.shape
Out[4]:(8474661, 70)
X = df[x_cols]
y = df[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.25)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
Out[8]:(6355995, 69) (6355995,)
(2118666, 69) (2118666,)
y_train.value_counts()
Out[9]:0.0 5734377
1.0 621618
dtype: int64
y_test.value_counts()
Out[10]:0.0 1911460
1.0 207206
dtype: int64
D_train = xgb.DMatrix(X_train, label=y_train)
D_test = xgb.DMatrix(X_test, label=y_test)
Here, I am getting memory error-
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-11-0a198674af60> in <module>
----> 1 D_train = xgb.DMatrix(X_train, label=y_train)
2 D_test = xgb.DMatrix(X_test, label=y_test)
~anaconda3envsPy-37libsite-packagesxgboostcore.py in __init__(self, data, label, missing, weight, silent, feature_names, feature_types, nthread)
378 data, feature_names, feature_types = _maybe_pandas_data(data,
379 feature_names,
--> 380 feature_types)
381
382 data, feature_names, feature_types = _maybe_dt_data(data,
~anaconda3envsPy-37libsite-packagesxgboostcore.py in _maybe_pandas_data(data, feature_names, feature_types)
251 feature_types = [PANDAS_DTYPE_MAPPER[dtype.name] for dtype in data_dtypes]
252
--> 253 data = data.values.astype('float')
254
255 return data, feature_names, feature_types
MemoryError: Unable to allocate 3.27 GiB for an array with shape (6355995, 69) and data type float64
How should I train this data?
using xgboost version - 0.90
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…