Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
262 views
in Technique[技术] by (71.8m points)

sql - AWS Athena - Query data from different years in partitions

We have large datasets partitioned in S3 like s3://bucket/year=YYYY/month=MM/day=DD/file.csv.

What would be the best way to query the data in Athena from different years and take advantage of the partitioning ?

Here's what I tried for data from 2018-03-07 to 2020-03-06:

Query 1 - running for 2min 45s before I cancel

SELECT dt, col1, col2
FROM mytable
WHERE year BETWEEN '2018' AND '2020'
AND dt BETWEEN '2018-03-07' AND '2020-03-06'
ORDER BY dt

Query 2 - run for about 2min. However I don't think it would be efficient if the period were from for example 2005 to 2020

SELECT dt, col1, col2
FROM mytable
WHERE (year = '2018' AND month >= '03' AND dt >= '2018-03-07')
OR year = '2019' OR (year = '2020' AND month <= '03' AND dt <= '2020-03-06')
ORDER BY dt
question from:https://stackoverflow.com/questions/65893036/aws-athena-query-data-from-different-years-in-partitions

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I would suggest to repartition table by dt only (yyyy-MM-dd) instead of year, month, day, this is simple and partition pruning will work, though queries using year only filter like where year>'2020' should be rewritten as dt>'2020-01-01' and so on.

Also BTW in Hive partition pruning works fine with queries like this:

where concat(year, '-', month, '-', day) >= '2018-03-07'
      and 
      concat(year, '-', month, '-', day) <= '2020-03-06'

I cant check does the same works in Presto or not but it worth trying. You can use || operator instead of concat().


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...