Say, I have the following dataframe:
df = pd.DataFrame({'a':['a','b','c (not a)', 'this is (random)']*10000})
I want to produce the following output:
array(['same as column', 'b', 'c', 'this is']*10000, dtype=object)
Towards that end, I defined the function below and passed it via the pandas apply method.
def fn(x):
if ' (' in x:
return x.split(' (')[0]
elif x=='a':
return 'same as column'
else:
return x
df['a'] = df['a'].apply(fn)
Then, others advised me to use vectorization, so I used the code below to produce my desired output.
df = np.select([df['a'].str.contains(' ('), df['a']=='a'],
[df['a'].str.split(' (').str[0], 'same as column'],
default=df['a'])
Instead of running faster, this vectorized version ran noticeably slower.
21.4 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
for the apply method
116 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
for the vectorization
What's going on here? Is this normal (I thought vectorization was the fastest option available)? Or is there a problem with my code?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…