I got the next parquet file:
+--------------+------------+-------+
|gf_cutoff | country_id |gf_mlt |
+--------------+------------+-------+
|2020-12-14 |DZ |5 |
|2020-08-06 |DZ |4 |
|2020-07-03 |DZ |4 |
|2020-12-14 |LT |1 |
|2020-08-06 |LT |1 |
|2020-07-03 |LT |1 |
As you can see is particioned by country_id and ordered by gf_cutoff DESC. What I want to do es compare gf_mlt to check if the value has changed. To do that I want to compare the most recently gf_cutoff with the second one.
A example of this case would be compare:
2020-12-14 DZ 5
with
2020-08-06 DZ 4
And I want to write in a new column, if the value of the most recent date is different of the second row, put in a new column, the most recent value that is 5 for DZ and put in another column True if the value has changed or false if has not changed.
Afther did this comparation, delete the rows with the older rows.
For DZ has changed and for LT hasn't changed because is all time 1.
So the output would be like this:
+--------------+------------+-------+------------+-----------+
|gf_cutoff | country_id |gf_mlt | Has_change | old_value |
+--------------+------------+-------+------------+-----------+
|2020-12-14 |DZ |5 | True | 4 |
|2020-12-14 |LT |1 | False | 1 |
If you need more explanation, just tell me it.
question from:
https://stackoverflow.com/questions/65842685/compare-two-values-with-scala-spark 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…