I have a pyspark dataframe with "id" and date column "parsed_date" (dtypes: date, format: YYYY-mm-dd). I have created a function that gives me count of id for each day in the given date range.
the feunction returns 2 dataframes. df1 with rows from date range ± 1 weeks and df2 with rows ±2 weeks from the given day in the following way:
df1 should filter rows from range 1: (day-t, day+t)
df2 should filter rows from range 2: (day - 2t, day - t).append(day + t, day + 2t)
This is the code that I have been using otherwise for creating df1 with the required date range but I don't know to append/concat date range for df2
def part_1(df, day, t):
"""
Example usage: df_list = part_1(df, '2017-12-18', 2)
Returns a list of 2 dataframes.
"""
h1_df1 = (df.filter(f"parsed_date between '{day}' - interval {t} days and '{day}' + interval {t} days")
.withColumn('count_before', F.count('id').over(Window.partitionBy('parsed_date')))
.orderBy('parsed_date')
)
h1_df2 = (df.filter(f"parsed_date between '{day}' - interval {t*2} days and '{day}' - interval {t} days").concat(f"parsed_date between ('{day}' + interval {t} days and '{day}' + interval {t*2} days")
.withColumn('count_after', F.count('id').over(Window.partitionBy('parsed_date')))
.orderBy('parsed_date')
)
return [h1_df1, h1_df2]
output for h1_df1 on calling part_1(df, '2017-12-18', 2)
+-------+-----------+------------+
| id|parsed_date|count_before|
+-------+-----------+------------+
|1471783| 2017-12-16| 2|
|1471885| 2017-12-16| 2|
|1472928| 2017-12-17| 2|
|1476917| 2017-12-17| 2|
|1477469| 2017-12-18| 1|
|1478190| 2017-12-19| 4|
|1478570| 2017-12-19| 4|
|1481415| 2017-12-19| 4|
|1472592| 2017-12-19| 4|
|1474023| 2017-12-20| 1|
+-------+-----------+------------+
expected outcome for h2_df1 on calling part_1(df, '2017-12-18', 2)
+-------+-----------+------------+
| id|parsed_date| count_after|
+-------+-----------+------------+
|1471783| 2017-12-14| 1|
|1471885| 2017-12-16| 3|
|1472928| 2017-12-16| 3|
|1476917| 2017-12-16| 3|
|1477469| 2017-12-20| 2|
|1478190| 2017-12-20| 2|
|1478570| 2017-12-21| 2|
|1481415| 2017-12-21| 2|
|1472592| 2017-12-22| 2|
|1474023| 2017-12-22| 2|
+-------+-----------+------------+
I would love to get some help for creating h2_df1. Though I gave it a try but it doesn't work.
Please help!
question from:
https://stackoverflow.com/questions/65893708/filter-pyspark-dataframe-rows-between-two-different-date-ranges 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…