Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
379 views
in Technique[技术] by (71.8m points)

apache spark - How to find out the number of unique elements for a column in a group in PySpark?

I have a PySpark dataframe-

df1 = spark.createDataFrame([
    ("u1", 1),
    ("u1", 2),
    ("u2", 1),
    ("u2", 1),
    ("u2", 1),
    ("u3", 3),
    ],
    ['user_id', 'var1'])

print(df1.printSchema())
df1.show(truncate=False)

Output-

root
 |-- user_id: string (nullable = true)
 |-- var1: long (nullable = true)

None
+-------+----+
|user_id|var1|
+-------+----+
|u1     |1   |
|u1     |2   |
|u2     |1   |
|u2     |1   |
|u2     |1   |
|u3     |3   |
+-------+----+

Now I want to group all the unique users and show the number of unique var for them in a new column. The desired output would look like-

+-------+---------------+
|user_id|num_unique_var1|
+-------+---------------+
|u1     |2              |
|u2     |1              |
|u3     |1              |
+-------+---------------+

I can use collect_set and make a udf to find the set's length. But I think there must be a better way to do it. How do I achieve this in one line of code?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
df1.groupBy('user_id').agg(F.countDistinct('var1').alias('num')).show()

countDistinct is exactly what I needed.

Output-

+-------+---+
|user_id|num|
+-------+---+
|     u3|  1|
|     u2|  1|
|     u1|  2|
+-------+---+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...