python - Pandas: peculiar performance drop for inplace rename after dropna

Question

Welcome To Ask or Share your Answers For Others

python - Pandas: peculiar performance drop for inplace rename after dropna

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Pandas: peculiar performance drop for inplace rename after dropna

I have reported this as an issue on pandas issues. In the meanwhile I post this here hoping to save others time, in case they encounter similar issues.

Upon profiling a process which needed to be optimized I found that renaming columns NOT inplace improves performance (execution time) by x120. Profiling indicates this is related to garbage collection (see below).

Furthermore, the expected performance is recovered by avoiding the dropna method.

The following short example demonstrates a factor x12:

import pandas as pd
import numpy as np

inplace=True

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
df = (df1-df2).dropna()
## inplace rename:
df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

100 loops, best of 3: 15.6 ms per loop

first output line of %%prun:

ncalls tottime percall cumtime percall filename:lineno(function)
1  0.018 0.018 0.018 0.018 {gc.collect}

inplace=False

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
df = (df1-df2).dropna()
## avoid inplace:
df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000 loops, best of 3: 1.24 ms per loop

avoid dropna

The expected performance is recovered by avoiding the dropna method:

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
#no dropna:
df = (df1-df2)#.dropna()
## inplace rename:
df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

1000 loops, best of 3: 865 μs per loop

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
## no dropna
df = (df1-df2)#.dropna()
## avoid inplace:
df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000 loops, best of 3: 902 μs per loop

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-16T22:11:49+0000

This is a copy of the explanation on github.

There is no guarantee that an inplace operation is actually faster. Often they are actually the same operation that works on a copy, but the top-level reference is reassigned.

The reason for the difference in performance in this case is as follows.

The (df1-df2).dropna() call creates a slice of the dataframe. When you apply a new operation, this triggers a SettingWithCopy check because it could be a copy (but often is not).

This check must perform a garbage collection to wipe out some cache references to see if it's a copy. Unfortunately python syntax makes this unavoidable.

You can not have this happen, by simply making a copy first.

df = (df1-df2).dropna().copy()

followed by an inplace operation will be as performant as before.

My personal opinion: I never use in-place operations. The syntax is harder to read and it does not offer any advantages.

Categories

python - Pandas: peculiar performance drop for inplace rename after dropna

python - Pandas: peculiar performance drop for inplace rename after dropna

inplace=True

inplace=False

avoid dropna

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags