Assume the following simplified framework:
I have a 3D Pandas dataframe of parameters composed of 100 rows, 4 classes and 4 features for each instance:
iterables = [list(range(100)), [0,1,2,3]]
index = pd.MultiIndex.from_product(iterables, names=['instances', 'classes'])
columns = ['a', 'b', 'c', 'd']
np.random.seed(42)
parameters = pd.DataFrame(np.random.randint(1, 2000, size=(len(index), len(columns))), index=index, columns=columns)
parameters
instances classes a b c d
0 0 1127 1460 861 1295
1 1131 1096 1725 1045
2 1639 122 467 1239
3 331 1483 88 1397
1 0 1124 872 1688 131
... ... ... ... ...
98 3 1321 1750 779 1431
99 0 1793 814 1637 1429
1 1370 1646 420 1206
2 983 825 1025 1855
3 1974 567 371 936
Let df
be a dataframe that for each instance and each feature (column), report the observed class.
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 3, size=(100, len(columns))), index=list(range(100)),
columns=columns)
a b c d
0 2 0 2 2
1 0 0 2 1
2 2 2 2 2
3 0 2 1 0
4 1 1 1 1
.. .. .. .. ..
95 1 2 0 1
96 2 1 2 1
97 0 0 1 2
98 0 0 0 1
99 1 2 2 2
I would like to create a third dataframe (let's call it new_df
) of shape (100, 4) containing the parameters in the dataframe parameters
based on the observed classes on the dataframe df
.
For example, in the first row of df
for the first column (a) i observe the class 2, so the value I am interested in is the second class in the first instance of the parameters
dataframe, namely 1127 that will populate the first row and column of new df
. Following this method, the first observation for the column "b" is class 0, so in the first row, column b of the new_df
I would like to observe 1460 and so on.
With a for loop I can obtain the desired result:
new_df = pd.DataFrame(0, index=list(range(100)), columns=columns) # initialize the df
for i in range(len(df)):
for c in df.columns:
new_df.iloc[i][c] = parameters.loc[i][c][df.iloc[i][c]]
new_df
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855
However, the original dataset contains millions of rows and hundreds of columns, and proceeding with for loop is unfeasible.
Is there a way to vectorize such a problem in order to avoid for loops? (at least over 1 dimension)
question from:
https://stackoverflow.com/questions/65889255/vectorization-pandas-dataframe