Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
294 views
in Technique[技术] by (71.8m points)

python - SettingWithCopy Warning when Standardizing Only Numeric Columns in Pandas DataFrame with Sklearn

I am getting a SettingWithCopyWarning from Pandas when performing the below operation. I understand what the warning means and I know I can turn the warning off but I am curious if I am performing this type of standardization incorrectly using a pandas dataframe (I have mixed data with categorical and numeric columns). My numbers seem fine after checking but I would like to clean up my syntax to make sure I am using Pandas correctly.

I am curious if there is a better workflow for this type of operation when dealing with data sets that have mixed data types like this.

My process is as follows with some toy data:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from typing import List

# toy data with categorical and numeric data
df: pd.DataFrame = pd.DataFrame([['0',100,'A', 10],
                                ['1',125,'A',15],
                                ['2',134,'A',20],
                                ['3',112,'A',25],
                                ['4',107,'B',35],
                                ['5',68,'B',50],
                                ['6',321,'B',10],
                                ['7',26,'B',27],
                                ['8',115,'C',64],
                                ['9',100,'C',72],
                                ['10',74,'C',18],
                                ['11',63,'C',18]], columns = ['id', 'weight','type','age'])
df.dtypes
id        object
weight     int64
type      object
age        int64
dtype: object

# select categorical data for later operations
cat_cols: List = df.select_dtypes(include=['object']).columns.values.tolist()
# select numeric columns for later operations
numeric_cols: List = df.columns[df.dtypes.apply(lambda x: np.issubdtype(x, np.number))].values.tolist()

# prepare data for modeling by splitting into train and test
# use only standardization means/standard deviations from the TRAINING SET only 
# and apply them to the testing set as to avoid information leakage from training set into testing set
X: pd.DataFrame = df.copy()
y: pd.Series = df.pop('type')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# perform standardization of numeric variables using the mean and standard deviations of the training set only
X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train_numeric_tmp)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])


<ipython-input-15-74f3f6c70f6a>:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
question from:https://stackoverflow.com/questions/65838015/settingwithcopy-warning-when-standardizing-only-numeric-columns-in-pandas-datafr

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Your X_train, X_test are still slices of the original dataframe. Modifying a slice triggers the warning and often doesn't work.

You can either transform before train_test_split, else do X_train = X_train.copy() after split, then transform.

The second approach would prevent information leak as commented in your code. So something like this:

# these 2 lines don't look good to me
# X: pd.DataFrame = df.copy()    # don't you drop the label?
# y: pd.Series = df.pop('type')  # y = df['type']

# pass them directly instead
features = [c for c in df if c!='type']
X_train, X_test, y_train, y_test = train_test_split(df[features], df['type'], 
                                                    test_size = 0.2, 
                                                    random_state = 0)

# now copy what we want to transform
X_train = X_train.copy()
X_test = X_test.copy()

## Code below should work without warning
############
# perform standardization of numeric variables using the mean and standard deviations of the training set only
# you don't need copy the data to fit
# X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train[numeric_cols)

X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...