python - Vectorization Pandas DataFrame

Question

Welcome To Ask or Share your Answers For Others

python - Vectorization Pandas DataFrame

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Vectorization Pandas DataFrame

Assume the following simplified framework:

I have a 3D Pandas dataframe of parameters composed of 100 rows, 4 classes and 4 features for each instance:

iterables = [list(range(100)), [0,1,2,3]]
index = pd.MultiIndex.from_product(iterables, names=['instances', 'classes'])
columns = ['a', 'b', 'c', 'd']
np.random.seed(42)
parameters = pd.DataFrame(np.random.randint(1, 2000, size=(len(index), len(columns))), index=index, columns=columns)

parameters

instances classes   a     b     c     d                     
0         0        1127  1460   861  1295
          1        1131  1096  1725  1045
          2        1639   122   467  1239
          3         331  1483    88  1397
1         0        1124   872  1688   131
...                 ...   ...   ...   ...
98        3        1321  1750   779  1431
99        0        1793   814  1637  1429
          1        1370  1646   420  1206
          2         983   825  1025  1855
          3        1974   567   371   936

Let df be a dataframe that for each instance and each feature (column), report the observed class.

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 3, size=(100, len(columns))), index=list(range(100)), 
                  columns=columns)
    a  b  c  d
0   2  0  2  2
1   0  0  2  1
2   2  2  2  2
3   0  2  1  0
4   1  1  1  1
.. .. .. .. ..
95  1  2  0  1
96  2  1  2  1
97  0  0  1  2
98  0  0  0  1
99  1  2  2  2

I would like to create a third dataframe (let's call it new_df) of shape (100, 4) containing the parameters in the dataframe parameters based on the observed classes on the dataframe df.

For example, in the first row of df for the first column (a) i observe the class 2, so the value I am interested in is the second class in the first instance of the parameters dataframe, namely 1127 that will populate the first row and column of new df. Following this method, the first observation for the column "b" is class 0, so in the first row, column b of the new_df I would like to observe 1460 and so on.

With a for loop I can obtain the desired result:

new_df = pd.DataFrame(0, index=list(range(100)), columns=columns) # initialize the df
for i in range(len(df)):
    for c in df.columns:
        new_df.iloc[i][c] = parameters.loc[i][c][df.iloc[i][c]]

new_df

    a     b      c    d
0   1639  1460   467  1239
1   1124   872   806   344
2   1083   511  1706  1500
3    958  1155  1268   563
4     14   242   777  1370
..   ...   ...   ...   ...
95  1435  1316  1709   755
96   346   712   363   815
97  1234   985   683  1348
98   127  1130  1009  1014
99  1370   825  1025  1855

However, the original dataset contains millions of rows and hundreds of columns, and proceeding with for loop is unfeasible.

Is there a way to vectorize such a problem in order to avoid for loops? (at least over 1 dimension)

question from:https://stackoverflow.com/questions/65889255/vectorization-pandas-dataframe

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:19:32+0000

Reshape both DataFrames, using stack, into a long format, then perform the merge and reshape, with unstack, back to the wide format. There's a bunch of renaming just so we can reference and align the columns in the merge.

(df.rename_axis(index='instances', columns='cols').stack().to_frame('classes')
   .merge(parameters.rename_axis(columns='cols').stack().rename('vals'),
          on=['instances', 'classes', 'cols'])
   .unstack(-1)['vals']
   .rename_axis(index=None, columns=None)
)

       a     b     c     d
0   1639  1460   467  1239
1   1124   872   806   344
2   1083   511  1706  1500
3    958  1155  1268   563
4     14   242   777  1370
..   ...   ...   ...   ...
95  1435  1316  1709   755
96   346   712   363   815
97  1234   985   683  1348
98   127  1130  1009  1014
99  1370   825  1025  1855

Categories

python - Vectorization Pandas DataFrame

python - Vectorization Pandas DataFrame

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags