Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
262 views
in Technique[技术] by (71.8m points)

Join on minimum date between two dates - Spark SQL

I have a table of daily data and a table of monthly data. I'm trying to retrieve one daily record corresponding to each monthly record. The wrinkles are that some days are missing from the daily data and the field I care about, new_status, is sometimes null on the month_end_date.

month_df

| ID | month_end_date |
| -- | -------------- |
| 1  | 2019-07-31     |
| 1  | 2019-06-30     |
| 2  | 2019-10-31     |

daily_df

| ID | daily_date | new_status |
| -- | ---------- | ---------- |
| 1  | 2019-07-29 | 1          |
| 1  | 2019-07-30 | 1          |
| 1  | 2019-08-01 | 2          |
| 1  | 2019-08-02 | 2          |
| 1  | 2019-08-03 | 2          |
| 1  | 2019-06-29 | 0          |
| 1  | 2019-06-30 | 0          |
| 2  | 2019-10-30 | 5          |
| 2  | 2019-10-31 | NULL       |
| 2  | 2019-11-01 | 6          |
| 2  | 2019-11-02 | 6          |

I want to fuzzy join daily_df to monthly_df where daily_date is >= month_end_dt and less than some buffer afterwards (say, 5 days). I want to keep only the record with the minimum daily date and a non-null new_status.

This post solves the issue using an OUTER APPLY in SQL Server, but that seems not to be an option in Spark SQL. I'm wondering if there's a method that is similarly computationally efficient that works in Spark.

question from:https://stackoverflow.com/questions/65907854/join-on-minimum-date-between-two-dates-spark-sql

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...