python - Specific number of test/train size for each class in sklearn

Question

Welcome To Ask or Share your Answers For Others

python - Specific number of test/train size for each class in sklearn

asked Nov 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Specific number of test/train size for each class in sklearn

Data:

import pandas as pd
data = pd.DataFrame({'classes':[1,1,1,2,2,2,2],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})

My code:

import numpy as np
from sklearn.cross_validation import train_test_split
X = np.array(data[['b','c']])  
y = np.array(data['classes'])     
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=4)

Question:

train_test_split will randomly choose test set from all the classes. Is there any way to have the same number of test set for each class? (For example, two data from class 1 and two data from class 2. Note that the total number of each classes are not equal)

Expected result:

y_test
array([1, 2, 2, 1], dtype=int64)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-11-06T03:17:47+0000

There is actually no sklearn function or parameter to do this directly. The stratify samples proportionately, which is not what you want as you indicated in your comment.

You can build a custom function, which is relatively slower but not tremendously slow on an absolute basis. Note that this is built for pandas objects.

def train_test_eq_split(X, y, n_per_class, random_state=None):
    if random_state:
        np.random.seed(random_state)
    sampled = X.groupby(y, sort=False).apply(
        lambda frame: frame.sample(n_per_class))
    mask = sampled.index.get_level_values(1)

    X_train = X.drop(mask)
    X_test = X.loc[mask]
    y_train = y.drop(mask)
    y_test = y.loc[mask]

    return X_train, X_test, y_train, y_test

Example case:

data = pd.DataFrame({'classes': np.repeat([1, 2, 3], [10, 20, 30]),
                     'b': np.random.randn(60),
                     'c': np.random.randn(60)})
y = data.pop('classes')

X_train, X_test, y_train, y_test = train_test_eq_split(
    data, y, n_per_class=5, random_state=123)

y_test.value_counts()
# 3    5
# 2    5
# 1    5
# Name: classes, dtype: int64

How it works:

Perform a groupby on X and sample n values from each group.
Get the inner index of this object. This is the index for our test sets, and its set difference with the original data is our train index.

Categories

python - Specific number of test/train size for each class in sklearn

python - Specific number of test/train size for each class in sklearn

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags