python - Why is numpy select is slower than a custom function via apply method?

Question

Welcome To Ask or Share your Answers For Others

python - Why is numpy select is slower than a custom function via apply method?

asked Feb 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Why is numpy select is slower than a custom function via apply method?

Say, I have the following dataframe:

df = pd.DataFrame({'a':['a','b','c (not a)', 'this is (random)']*10000})

I want to produce the following output:

array(['same as column', 'b', 'c', 'this is']*10000, dtype=object)

Towards that end, I defined the function below and passed it via the pandas apply method.

def fn(x):
    if ' (' in x:
        return x.split(' (')[0]
    elif x=='a':
        return 'same as column'
    else:
        return x

df['a'] = df['a'].apply(fn)

Then, others advised me to use vectorization, so I used the code below to produce my desired output.

df = np.select([df['a'].str.contains(' ('), df['a']=='a'], 
               [df['a'].str.split(' (').str[0], 'same as column'], 
               default=df['a'])

Instead of running faster, this vectorized version ran noticeably slower.

21.4 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) for the apply method

116 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) for the vectorization

What's going on here? Is this normal (I thought vectorization was the fastest option available)? Or is there a problem with my code?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-16T20:56:08+0000

You compare different implementations in your benchmark and come to wrong conclusions. The relevant factor is that pandas str functions are not vectorized but rather implicit loops.

With a colab instance these are the results for your benchmark

%%timeit
df['a'].apply(fn)

100 loops, best of 3: 8.79 ms per loop

%%timeit
np.select([df['a'].str.contains(' ('), df['a']=='a'], 
    [df['a'].str.split(' (').str[0], 'same as column'], 
    default=df['a'])

10 loops, best of 3: 51.3 ms per loop

If we want to know where the time is spent

%%timeit
df['a'].str.contains(' (')
df['a'].str.split(' (').str[0]

10 loops, best of 3: 48.2 ms per loop

And finally comparing python's string split with pandas str.split

%timeit df['a'].str.split(' (').str[0]
%timeit [x.split(' (')[0] for x in df['a'].to_list()]

10 loops, best of 3: 36.3 ms per loop
100 loops, best of 3: 6.59 ms per loop

Categories

python - Why is numpy select is slower than a custom function via apply method?

python - Why is numpy select is slower than a custom function via apply method?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags