I have a dataframe with schema -
|-- record_id: integer (nullable = true)
|-- Data1: string (nullable = true)
|-- Data2: string (nullable = true)
|-- Data3: string (nullable = true)
|-- Time: timestamp (nullable = true)
I wanted to retrieve the last record in the data, grouping by the record_id and with the greatest timestamp.
So,if the data is initially this:
+----------+---------+---------+---------+-----------------------+
|record_id |Data1 |Data2 |Data3 | Time|
+----------+---------+-------------------------------------------+
| 1 | aaa | null | null | 2018-06-04 21:51:53.0 |
| 1 | null | bbbb | cccc | 2018-06-05 21:51:53.0 |
| 1 | aaa | null | dddd | 2018-06-06 21:51:53.0 |
| 1 | qqqq | wwww | eeee | 2018-06-07 21:51:53.0 |
| 2 | aaa | null | null | 2018-06-04 21:51:53.0 |
| 2 | aaaa | bbbb | cccc | 2018-06-05 21:51:53.0 |
| 3 | aaa | null | dddd | 2018-06-06 21:51:53.0 |
| 3 | aaaa | bbbb | eeee | 2018-06-08 21:51:53.0 |
I want the output to be
+----------+---------+---------+---------+-----------------------+
|record_id |Data1 |Data2 |Data3 | Time|
+----------+---------+-------------------------------------------+
| 1 | qqqq | wwww | eeee | 2018-06-07 21:51:53.0 |
| 2 | aaaa | bbbb | cccc | 2018-06-05 21:51:53.0 |
| 3 | aaaa | bbbb | eeee | 2018-06-08 21:51:53.0 |
I tried to join 2 queries on the same stream, similar to the answer here.
My code (where df1 is the original dataframe) :
df1=df1.withWatermark("Timetemp", "2 seconds")
df1.createOrReplaceTempView("tbl")
df1.printSchema()
query="select t.record_id as record_id, max(t.Timetemp) as Timetemp from tbl t group by t.record_id"
df2=spark.sql(query)
df2=df2.withWatermark("Timetemp", "2 seconds")
qws=df1.alias('a').join(df2.alias('b'),((col('a.record_id')==col('b.record_id')) & (col("a.Timetemp")==col("b.Timetemp"))))
query = qws.writeStream.outputMode('append').format('console').start()
query.awaitTermination()
I keep getting this error,though:
Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
When there is clearly a watermark. What can be done ?
I cannot use windowing since non time based windowing is not supported in streaming.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…