Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.8k views
in Technique[技术] by (71.8m points)

python - Sklearn metrics for regression differ depending on the evaluation method. How to get similar scores

I'm trying to get "same" metrics using a RFECV, and a cross_val_score method. The 2nd method comes because it's really important for me to get metrics with their standard deviation (uncertainties are cool).

This is the regression model:

regression = Lasso(alpha=0.1,
              selection="random",
              max_iter=10000,
              random_state=42)

The RFECV method:

min_number_features =  df.shape[0]//10     
rfecv = RFECV(estimator=regression,
                step=1, 
                  min_features_to_select=min_number_features, 
                  cv=KFold(n_splits=10,
                        shuffle=True,
                        random_state=42),
                  scoring='neg_mean_squared_error')
                   
rfecv.fit(X_train, target_train)
score = rfecv.score(X_train, target_train)

On aveage, it gives rmse of 0.84. The cross_val_score method is the following:

metrics_cross_val_score=[
                    "neg_root_mean_squared_error",
                    "neg_mean_squared_error",
                     "r2",
                     "explained_variance",
                     "neg_mean_absolute_error",
                     "max_error",
                     "neg_median_absolute_error"
                    ]
for m in metrics_cross_val_score:
    score=cross_val_score(regression, 
        X_train, 
        target_train, 
        cv=KFold(n_splits=10,
            shuffle=True,
            random_state=42),
        scoring=m)
    score= [-score.mean()/mean,score.std()/mean]    

    metrics[m]=round(score[0],2)
    dev="std_"+m
    metrics[dev]=round(score[1],2)

For the 2nd method, I normalize every metric by the mean (in an attempt to have a from-0-to-1 score): The results tend to not be exactly like with the 1st method (although the RFECV RMSE is within the interval of the cross_val_score RMSE +- the standard deviation, which is quite big and not-good).

So, here comes the questions:

  • I read many ways of normalizing the RMSE (by the mean, by y_max-y_min, by quantiles... And I don't know yet the best approach for my data. Anyone has a bright recommendation for that?

  • The RFECV is working with the selected features, and cross_val_score with all features. If cross_val_score works with the very same columns than RFECV selects, the wellness of cross_val_score RMSE decay dramatically, and that really puzzles me.

Here is a comparison between RFECV RMSE (alg_score), and cross_val_score metrics with standard deviation (everything else). enter image description here

Hope I made myself understood. If you feel curious, here is the dashboard with everything related to that: https://datastudio.google.com/s/gUKsAyZfI5I


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神解答

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...