Use countDistinct function
from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])
gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()
output
+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002| 2|
|2001| 2|
+----+------------------+
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…