Welcome To Ask or Share your Answers For Others

data cleaning - Fill missing Values by a ratio of other values in Pandas

Welcome To Ask or Share your Answers For Others

1 Answer

answered Oct 24, 2021 by 深蓝 (71.8m points)

Starting with this DataFrame (only to create something similar to yours):

import numpy as np
df = pd.DataFrame({'C1': np.random.choice(['SC', 'ST', 'GEN'], p=[0.16, 0.08, 0.76], 
                                          size=1000)})
df.loc[df.sample(frac=0.22).index] = np.nan

It yields a column with 22% NaN and the remaining proportions are similar to yours:

df['C1'].value_counts(normalize=True, dropna=False)
Out: 
GEN    0.583
NaN    0.220
SC     0.132
ST     0.065
Name: C1, dtype: float64

df['C1'].value_counts(normalize=True)
Out: 
GEN    0.747436
SC     0.169231
ST     0.083333
Name: C1, dtype: float64

Now you can use fillna with np.random.choice:

df['C1'] = df['C1'].fillna(pd.Series(np.random.choice(['SC', 'ST', 'GEN'], 
                                                      p=[0.16, 0.08, 0.76], size=len(df))))

The resulting column will have these proportions:

df['C1'].value_counts(normalize=True, dropna=False)
Out: 
GEN    0.748
SC     0.165
ST     0.087
Name: C1, dtype: float64

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

...