how to groupby without aggregation in pyspark dataframe

Question

Welcome To Ask or Share your Answers For Others

how to groupby without aggregation in pyspark dataframe

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

how to groupby without aggregation in pyspark dataframe

I have a very huge dataset and I need to use pyspark dataframe. Please see simplified version of data:

product_type    series_no    product_amount    date
    514            111             20          2020/01/01 (YYYY/MM/DD)
    514            111             30          2020/01/02
    514            111             40          2020/01/03
    514            111             50          2020/01/04
    514            112             60          2020/01/01
    514            112             70          2020/01/02
    514            112             80          2020/01/03

I am trying to groupBy this data with (product_type, series_no) to get group of datas without aggregations. For this simplified version of data, we have two groups:

    group1:
    514            111             20          2020/01/01
    514            111             30          2020/01/02
    514            111             40          2020/01/03
    514            111             50          2020/01/04
    group2:
    514            112             60          2020/01/01
    514            112             70          2020/01/02
    514            112             80          2020/01/03

Is there any way to get those groups with pyspark dataframe. The data is very huge and it throws memory error if I convert them all to python pandas. I am trying to get groups stated in pseudo code given below:

Assume the data is store on df_all pyspark dataframe.

for group in df_all.groups:
    // convert to pandas dataframe.

Please let me know if is there any efficient way to do this with pyspark dataframe.

question from:https://stackoverflow.com/questions/65852508/how-to-groupby-without-aggregation-in-pyspark-dataframe

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:28:48+0000

You can get your groups like this. First, collect distinct values of product_type and series_no columns then loop through those values and filter the original data frame:

from pyspark.sql.functions import col

groups = list(map(
    lambda row: (row.product_type, row.series_no),
    df.select("product_type", "series_no").distinct().collect()
))

for group in groups:
    # replace here with your logic
    print(f"Group: product_type={group[0]} and series_no={group[1]}")
    df.filter((col("product_type") == group[0]) & (col("series_no") == group[1])).show()


#Group: product_type=514 and series_no=112
#+------------+---------+--------------+----------+
#|product_type|series_no|product_amount|      date|
#+------------+---------+--------------+----------+
#|         514|      112|            60|2020/01/01|
#|         514|      112|            70|2020/01/02|
#|         514|      112|            80|2020/01/03|
#+------------+---------+--------------+----------+

#Group: product_type=514 and series_no=111
#+------------+---------+--------------+----------+
#|product_type|series_no|product_amount|      date|
#+------------+---------+--------------+----------+
#|         514|      111|            20|2020/01/01|
#|         514|      111|            30|2020/01/02|
#|         514|      111|            40|2020/01/03|
#|         514|      111|            50|2020/01/04|
#+------------+---------+--------------+----------+

Categories

how to groupby without aggregation in pyspark dataframe

how to groupby without aggregation in pyspark dataframe

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags