Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
486 views
in Technique[技术] by (71.8m points)

python - Numpy vectorization messes up data type (2)

I'm having unwanted behaviour come out of np.vectorize, namely, it changes the datatype of the argument going into the original function. My original question is about the general case, and I'll use this new question to ask a more specific case.

(Why this second question? I've created this question about a more specific case in order to illustrate the problem - it's always easier to go from the specific to the more general. And I've created this question seperately, because I think it's useful to keep the general case, as well as a general answer to it (should one be found), by themselves and not 'contaminated' with thinking about solving any particular problem.)

So, a concrete example. Where I live, Wednesday is Lottery Day. So, let's start with a pandas dataframe with a date column with all Wednesdays this year:

df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=53)})

I want to see which of these possible days I'll actually play on. I don't feel particularly lucky at the beginning and end of each month, and there are some months I feel especially unlucky about. Therefore I use this function to see if a date qualifies:

def qualifies(dt, excluded_months = []):
    #Date qualifies, if...
    #. it's on or after the 5th of the month; and
    #. at least 5 days remain till the end of the month (incl. date itself); and
    #. it's not in one of the months in excluded_months.
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

I hope you realise that this example is still somewhat contrived ;) But it's closer to what I'm trying to do. I try to apply this function in two ways:

df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))
df['qualifies2'] = np.vectorize(qualifies, excluded=[1])(df['date'], [3, 8])

As far as I know, both should work, and I'd prefer the latter, as the former is slow and frowned upon. Edit: I've learned that also the first is frowned upon lol.

However, only the first one succeeds, the second one fails with an AttributeError: 'numpy.datetime64' object has no attribute 'day'. And so my question is, if there is a way to use np.vectorize on this function qualifies, which takes a datetime/timestamp as an argument.

Many thanks!

PS: for the interested, this is df:

In [15]: df
Out[15]: 
         date  qualifies1
0  2020-01-01       False
1  2020-01-08        True
2  2020-01-15        True
3  2020-01-22        True
4  2020-01-29       False
5  2020-02-05        True
6  2020-02-12        True
7  2020-02-19        True
8  2020-02-26       False
9  2020-03-04       False
10 2020-03-11       False
11 2020-03-18       False
12 2020-03-25       False
13 2020-04-01       False
14 2020-04-08        True
15 2020-04-15        True
16 2020-04-22        True
17 2020-04-29       False
18 2020-05-06        True
19 2020-05-13        True
20 2020-05-20        True
21 2020-05-27        True
22 2020-06-03       False
23 2020-06-10        True
24 2020-06-17        True
25 2020-06-24        True
26 2020-07-01       False
27 2020-07-08        True
28 2020-07-15        True
29 2020-07-22        True
30 2020-07-29       False
31 2020-08-05       False
32 2020-08-12       False
33 2020-08-19       False
34 2020-08-26       False
35 2020-09-02       False
36 2020-09-09        True
37 2020-09-16        True
38 2020-09-23        True
39 2020-09-30       False
40 2020-10-07        True
41 2020-10-14        True
42 2020-10-21        True
43 2020-10-28       False
44 2020-11-04       False
45 2020-11-11        True
46 2020-11-18        True
47 2020-11-25        True
48 2020-12-02       False
49 2020-12-09        True
50 2020-12-16        True
51 2020-12-23        True
52 2020-12-30       False
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I think @rpanai answer on the original post is still the best. Here I share my tests:

def qualifies(dt, excluded_months = []):
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

def new_qualifies(dt, excluded_months = []):
    dt = pd.Timestamp(dt)
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=12000)})

apply method:

%%timeit
df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))

385 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


conversion method:

%%timeit
df['qualifies1'] = df['date'].apply(lambda x: new_qualifies(x, [3, 8]))

389 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


vectorized code:

%%timeit
df['qualifies2'] =  np.logical_not((df['date'].dt.day<5).values | 
    ((df['date']+pd.tseries.offsets.MonthBegin(1)-df['date']).dt.days < 5).values |
    (df['date'].dt.month.isin([3, 8])).values)

4.83 ms ± 117 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...