Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
258 views
in Technique[技术] by (71.8m points)

sql - Getting counts for overlapping time periods

I have a table data in PostgreSQL with this structure:

created_at.       customer_email               status
2020-12-31        [email protected]                opened
...
2020-12-24        [email protected]                delivered
2020-12-24        [email protected]                opened
...
2020-12-17        [email protected]                opened
2020-12-10        [email protected]                opened
2020-12-03        [email protected]                enqueued
2020-11-27        [email protected]                opened
...
2020-11-20        [email protected]                opened
2020-11-13        [email protected]                opened

There are many rows for each day.

Basically I need 2021-W01 for this week with the count of unique emails with status "opened" within the last 90 days. Likewise for every week before that.

Desired output:

period    active
2021-W01  1539
2020-W53  1480
2020-W52  1630
2020-W51  1820
2020-W50  1910
2020-W49  1890
2020-W48  2000

How can I do that?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Window functions would come to mind. Alas, those don't allow DISTINCT aggregations.

Instead, get distinct counts from a LATERAL subquery:

WITH weekly_dist AS (
   SELECT DISTINCT date_trunc('week', created_at) AS wk, customer_email
   FROM   tbl
   WHERE  status = 'opened'
   )
SELECT to_char(t.wk, 'YYYY"-W"IW') AS period, ct.active
FROM  (
   SELECT generate_series(date_trunc('week', min(created_at) + interval '1 week')
                        , date_trunc('week', now()::timestamp)
                        , interval '1 week') AS wk   
   FROM   tbl
   ) t
LEFT   JOIN LATERAL (
   SELECT count(DISTINCT customer_email) AS active
   FROM   weekly_dist d
   WHERE  d.wk >= t.wk - interval '91 days'
   AND    d.wk <  t.wk
   ) ct ON true;

db<>fiddle here

I operate with timestamp, not timestamptz, might make a corner case difference.

The CTE weekly_dist reduces the set to distinct "opened" emails. This step is strictly optional, but increases performance significantly if there can be more than a few duplicates per week.

The derived table t generates a timestamp for the begin of each week since the earliest entry in the table up to "now". This way I make sure no week is skipped,even if there are no rows for it. See:

But I do skip the first week since I count active emails before each start of the week.

Then LEFT JOIN LATERAL to a subquery computing the distinct count for the 90-day time-range. To be precise, I deduct 91 days, and exclude the start of the current week. This happens to fall in line with the weekly pre-aggregated data from the CTE. Be wary of that if you shift bounds.

Finally, to_char(t.wk, 'YYYY"-W"IW') is a compact expression to get your desired format for week numbers. Details in the manual here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...