Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
174 views
in Technique[技术] by (71.8m points)

python - Obtaining test scores separately for each group after running nested-cross validation with LeavePGroupsOut

I am using sklearn.model_selection.LeavePGroupsOut to train a classifier on each of the sites in my dataset and test it on all other sites. Now I have this problem: After running the analysis I only obtain a 'global' test score for all p sites that are were used for testing. Instead what I am looking for is a way to obtain a test score separately for each site.

Here's an example where I use the breast_cancer data set and create three dummy sites to which the subjects are assigned (Note that I created different sample sizes for each of the groups, see the lower section why I did this):

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import LeavePGroupsOut
from sklearn.model_selection import cross_validate
from sklearn.datasets import load_breast_cancer

# create a random number generator
rng = np.random.RandomState(42)

# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)

# for this example, only take the first 300 subjects
X = X[0:300,:]
y = y[0:300]

# define dummy sites, let's assume all subjects came from three different sites
# Let's also assume the three sites have different numbers of subjects
groups = np.concatenate((np.repeat('site_1',150),
                         np.repeat('site_2',100),
                         np.repeat('site_3',50)))

# optimize classifier on one site and leave two sites out for testing
n_groups = 2

# z-standardize features
scaler = StandardScaler()

# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng)

# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=0.015625,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}

# create pipeline
lr_pipe = Pipeline([
    ('scaler',scaler),
    ('lr',lr)
    ])

# define inner and outer folds (use LeavePGroupsOut)
skf_inner = StratifiedKFold(shuffle=True,random_state=rng)
lpgo_outer = LeavePGroupsOut(n_groups=n_groups)

# implement GridSearch (inner cross validation)
grid = GridSearchCV(lr_pipe,
                    param_grid=p_grid,
                    cv=skf_inner,
                    verbose=1,
                    )

# implement cross_validate (outer cross validation)
nested_cv_scores = cross_validate(grid,
                                  X,
                                  y,
                                  groups=groups,
                                  cv=lpgo_outer,
                                  return_train_score=True,
                                  return_estimator=True,
                                  verbose=1
                                  )

Now when one looks at nested_cv_scores['test_score'] one gets these three test scores: 0.915, 0.945, 0.96. Instead I want to obtain 6 scores (each of the three sites is used once for training and two others are used for testing).

What I already came up with:

I already came up with the idea to obtain the pipeline object from each of the three final estimators (nested_cv_scores['estimator'][idx].best_estimator_) and to run LeavePGroupsOut again by using

 train_index, test_index in lpgo_outer.split(X, y, groups):
    ...

With that, I guess one could recalculate the test scores separately for each site (by calling the predict method and then calculating the test score using y_pred and y_true.

Though I wonder, if there could be a more elegant way to the problem? Maybe I have overseen an alternative to LeavePGroupsOut? Also note that I can't use sklearn.model_selection.cross_val_predict here, because the three sites have different sample sizes (when using cross_val_predict instead of cross_validate one gets a ValueError: cross_val_predict only works for partitions)

question from:https://stackoverflow.com/questions/65916969/obtaining-test-scores-separately-for-each-group-after-running-nested-cross-valid

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This should do the trick for now:

site_scores = []

for idx,(train_index,test_index) in enumerate(lpgo_outer.split(X, y, groups)):
    
    # obtain name of the site that was used for training the classifier
    train_site_name = str(np.unique(groups[train_index])[0])
    
    # obtain the final estimator object for this training site
    train_site_estimator = nested_cv_scores['estimator'][idx].best_estimator_
    
    # obtain the train score for this estimator
    train_site_train_score = nested_cv_scores['train_score'][idx]
    
    # get the features and labels for all the other sites
    X_test,y_test = X[test_index],y[test_index]
    
    # obtain predictions
    y_pred = train_site_estimator.predict(X_test)
    
    # sanity check: make sure that the following score matches 'test_score'
    # in nested_cv_scores['test_score']
    sanity_check_bac = balanced_accuracy_score(y_true=y_test,y_pred=y_pred)
    
    if sanity_check_bac != nested_cv_scores['test_score'][idx]:
        raise ValueError('Manually calculcated test score does not match test score in nested_cv_scores')
    
    # get an array for the test sites
    test_sites = groups[test_index]
    
    # create a dataframe from y_true,y_pred and names of test sites
    test_sites_df = pd.DataFrame({'y_true':y_test,
                                  'y_pred':y_pred,
                                  'site':test_sites})
    
    # calculate BAC seperately for each site
    for name,group in test_sites_df.groupby('site'):
        
        bac = balanced_accuracy_score(group['y_true'],group['y_pred'])
        site_scores.append((train_site_name,train_site_train_score,name,bac))

df = pd.DataFrame(site_scores,columns=['train_site','train_site_score',
                                         'test_site','test_site_score'])
    

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...