Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

apache spark - filter pyspark dataframe rows between two different date ranges

I have a pyspark dataframe with "id" and date column "parsed_date" (dtypes: date, format: YYYY-mm-dd). I have created a function that gives me count of id for each day in the given date range.

the feunction returns 2 dataframes. df1 with rows from date range ± 1 weeks and df2 with rows ±2 weeks from the given day in the following way:

df1 should filter rows from range 1: (day-t, day+t) df2 should filter rows from range 2: (day - 2t, day - t).append(day + t, day + 2t)

This is the code that I have been using otherwise for creating df1 with the required date range but I don't know to append/concat date range for df2

def part_1(df, day, t):
"""
Example usage: df_list = part_1(df, '2017-12-18', 2)
Returns a list of 2 dataframes.
"""
h1_df1 = (df.filter(f"parsed_date between '{day}' - interval {t} days and '{day}' + interval {t} days")
         .withColumn('count_before', F.count('id').over(Window.partitionBy('parsed_date')))
         .orderBy('parsed_date')
      )


h1_df2 = (df.filter(f"parsed_date between '{day}' - interval {t*2} days and '{day}' - interval {t} days").concat(f"parsed_date between ('{day}' + interval {t} days and '{day}' + interval {t*2} days")
         .withColumn('count_after', F.count('id').over(Window.partitionBy('parsed_date')))
         .orderBy('parsed_date')
      )


return [h1_df1, h1_df2]

output for h1_df1 on calling part_1(df, '2017-12-18', 2)

+-------+-----------+------------+
|     id|parsed_date|count_before|
+-------+-----------+------------+
|1471783| 2017-12-16|           2|
|1471885| 2017-12-16|           2|
|1472928| 2017-12-17|           2|
|1476917| 2017-12-17|           2|
|1477469| 2017-12-18|           1|
|1478190| 2017-12-19|           4|
|1478570| 2017-12-19|           4|
|1481415| 2017-12-19|           4|
|1472592| 2017-12-19|           4|
|1474023| 2017-12-20|           1|
+-------+-----------+------------+

expected outcome for h2_df1 on calling part_1(df, '2017-12-18', 2)

 +-------+-----------+------------+
    |     id|parsed_date| count_after|
    +-------+-----------+------------+
    |1471783| 2017-12-14|           1|
    |1471885| 2017-12-16|           3|
    |1472928| 2017-12-16|           3|
    |1476917| 2017-12-16|           3|
    |1477469| 2017-12-20|           2|
    |1478190| 2017-12-20|           2|
    |1478570| 2017-12-21|           2|
    |1481415| 2017-12-21|           2|
    |1472592| 2017-12-22|           2|
    |1474023| 2017-12-22|           2|
    +-------+-----------+------------+

I would love to get some help for creating h2_df1. Though I gave it a try but it doesn't work.

Please help!

question from:https://stackoverflow.com/questions/65893708/filter-pyspark-dataframe-rows-between-two-different-date-ranges

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can use or to combine the two filter conditions, not concat.

h1_df2 = (df.filter(f"""
    (parsed_date between '{day}' - interval {t*2} days and '{day}' - interval {t} days) 
    or
    (parsed_date between '{day}' + interval {t} days and '{day}' + interval {t*2} days)
""")
         .withColumn('count_after', F.count('id').over(Window.partitionBy('parsed_date')))
         .orderBy('parsed_date')
      )

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...