Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
201 views
in Technique[技术] by (71.8m points)

python - Why is numpy select is slower than a custom function via apply method?

Say, I have the following dataframe:

df = pd.DataFrame({'a':['a','b','c (not a)', 'this is (random)']*10000})

I want to produce the following output:

array(['same as column', 'b', 'c', 'this is']*10000, dtype=object)

Towards that end, I defined the function below and passed it via the pandas apply method.

def fn(x):
    if ' (' in x:
        return x.split(' (')[0]
    elif x=='a':
        return 'same as column'
    else:
        return x

df['a'] = df['a'].apply(fn)

Then, others advised me to use vectorization, so I used the code below to produce my desired output.

df = np.select([df['a'].str.contains(' ('), df['a']=='a'], 
               [df['a'].str.split(' (').str[0], 'same as column'], 
               default=df['a'])

Instead of running faster, this vectorized version ran noticeably slower.

21.4 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) for the apply method

116 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) for the vectorization

What's going on here? Is this normal (I thought vectorization was the fastest option available)? Or is there a problem with my code?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You compare different implementations in your benchmark and come to wrong conclusions. The relevant factor is that pandas str functions are not vectorized but rather implicit loops.

With a colab instance these are the results for your benchmark

%%timeit
df['a'].apply(fn)
100 loops, best of 3: 8.79 ms per loop
%%timeit
np.select([df['a'].str.contains(' ('), df['a']=='a'], 
    [df['a'].str.split(' (').str[0], 'same as column'], 
    default=df['a'])
10 loops, best of 3: 51.3 ms per loop

If we want to know where the time is spent

%%timeit
df['a'].str.contains(' (')
df['a'].str.split(' (').str[0]
10 loops, best of 3: 48.2 ms per loop

And finally comparing python's string split with pandas str.split

%timeit df['a'].str.split(' (').str[0]
%timeit [x.split(' (')[0] for x in df['a'].to_list()]
10 loops, best of 3: 36.3 ms per loop
100 loops, best of 3: 6.59 ms per loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...