Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
643 views
in Technique[技术] by (71.8m points)

how to groupby without aggregation in pyspark dataframe

I have a very huge dataset and I need to use pyspark dataframe. Please see simplified version of data:

product_type    series_no    product_amount    date
    514            111             20          2020/01/01 (YYYY/MM/DD)
    514            111             30          2020/01/02
    514            111             40          2020/01/03
    514            111             50          2020/01/04
    514            112             60          2020/01/01
    514            112             70          2020/01/02
    514            112             80          2020/01/03

I am trying to groupBy this data with (product_type, series_no) to get group of datas without aggregations. For this simplified version of data, we have two groups:

    group1:
    514            111             20          2020/01/01
    514            111             30          2020/01/02
    514            111             40          2020/01/03
    514            111             50          2020/01/04
    group2:
    514            112             60          2020/01/01
    514            112             70          2020/01/02
    514            112             80          2020/01/03

Is there any way to get those groups with pyspark dataframe. The data is very huge and it throws memory error if I convert them all to python pandas. I am trying to get groups stated in pseudo code given below:

Assume the data is store on df_all pyspark dataframe.

for group in df_all.groups:
    // convert to pandas dataframe.

Please let me know if is there any efficient way to do this with pyspark dataframe.

question from:https://stackoverflow.com/questions/65852508/how-to-groupby-without-aggregation-in-pyspark-dataframe

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can get your groups like this. First, collect distinct values of product_type and series_no columns then loop through those values and filter the original data frame:

from pyspark.sql.functions import col

groups = list(map(
    lambda row: (row.product_type, row.series_no),
    df.select("product_type", "series_no").distinct().collect()
))

for group in groups:
    # replace here with your logic
    print(f"Group: product_type={group[0]} and series_no={group[1]}")
    df.filter((col("product_type") == group[0]) & (col("series_no") == group[1])).show()


#Group: product_type=514 and series_no=112
#+------------+---------+--------------+----------+
#|product_type|series_no|product_amount|      date|
#+------------+---------+--------------+----------+
#|         514|      112|            60|2020/01/01|
#|         514|      112|            70|2020/01/02|
#|         514|      112|            80|2020/01/03|
#+------------+---------+--------------+----------+

#Group: product_type=514 and series_no=111
#+------------+---------+--------------+----------+
#|product_type|series_no|product_amount|      date|
#+------------+---------+--------------+----------+
#|         514|      111|            20|2020/01/01|
#|         514|      111|            30|2020/01/02|
#|         514|      111|            40|2020/01/03|
#|         514|      111|            50|2020/01/04|
#+------------+---------+--------------+----------+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...