apache spark - How to find out the number of unique elements for a column in a group in PySpark?

Question

Welcome To Ask or Share your Answers For Others

apache spark - How to find out the number of unique elements for a column in a group in PySpark?

asked Feb 19, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - How to find out the number of unique elements for a column in a group in PySpark?

I have a PySpark dataframe-

df1 = spark.createDataFrame([
    ("u1", 1),
    ("u1", 2),
    ("u2", 1),
    ("u2", 1),
    ("u2", 1),
    ("u3", 3),
    ],
    ['user_id', 'var1'])

print(df1.printSchema())
df1.show(truncate=False)

Output-

root
 |-- user_id: string (nullable = true)
 |-- var1: long (nullable = true)

None
+-------+----+
|user_id|var1|
+-------+----+
|u1     |1   |
|u1     |2   |
|u2     |1   |
|u2     |1   |
|u2     |1   |
|u3     |3   |
+-------+----+

Now I want to group all the unique users and show the number of unique var for them in a new column. The desired output would look like-

+-------+---------------+
|user_id|num_unique_var1|
+-------+---------------+
|u1     |2              |
|u2     |1              |
|u3     |1              |
+-------+---------------+

I can use collect_set and make a udf to find the set's length. But I think there must be a better way to do it. How do I achieve this in one line of code?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-19T04:03:32+0000

df1.groupBy('user_id').agg(F.countDistinct('var1').alias('num')).show()

countDistinct is exactly what I needed.

Output-

+-------+---+
|user_id|num|
+-------+---+
|     u3|  1|
|     u2|  1|
|     u1|  2|
+-------+---+

Categories

apache spark - How to find out the number of unique elements for a column in a group in PySpark?

apache spark - How to find out the number of unique elements for a column in a group in PySpark?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags