I have a very huge dataset and I need to use pyspark dataframe. Please see simplified version of data:
product_type series_no product_amount date
514 111 20 2020/01/01 (YYYY/MM/DD)
514 111 30 2020/01/02
514 111 40 2020/01/03
514 111 50 2020/01/04
514 112 60 2020/01/01
514 112 70 2020/01/02
514 112 80 2020/01/03
I am trying to groupBy this data with (product_type, series_no) to get group of datas without aggregations. For this simplified version of data, we have two groups:
group1:
514 111 20 2020/01/01
514 111 30 2020/01/02
514 111 40 2020/01/03
514 111 50 2020/01/04
group2:
514 112 60 2020/01/01
514 112 70 2020/01/02
514 112 80 2020/01/03
Is there any way to get those groups with pyspark dataframe. The data is very huge and it throws memory error if I convert them all to python pandas. I am trying to get groups stated in pseudo code given below:
Assume the data is store on df_all pyspark dataframe.
for group in df_all.groups:
// convert to pandas dataframe.
Please let me know if is there any efficient way to do this with pyspark dataframe.
question from:
https://stackoverflow.com/questions/65852508/how-to-groupby-without-aggregation-in-pyspark-dataframe 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…