Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
991 views
in Technique[技术] by (71.8m points)

pyspark - Spark count records into specified ranges

I am trying to split a column of total count into different ranges of columns using pyspark. I am well versed with doing this in SQL but not clear on how to do it using PySpark. Glad if anyone can enlighten me on this.

I want to sort the matches columns into 3 different bins of columns where:

  • matches = 0,
  • matches => 1 & < =3,
  • matches => 1 & < =5

Sample DataFrame:

+-----+—-------+
|names| matches|
+-----+-—------+
|  Sam|       1| 
|  Tom|       3| 
|  Max|       5|
|  Kai|       7|
+-----+—-------+

Expected DataFrame Outcome:

+-----------+-----------+-------+
| 0 matches | lessthan3 | upto5 |
+-----------+-----------+-------+
|          0|          1|     3 |
+-----------+-----------+-------+
question from:https://stackoverflow.com/questions/65867294/spark-count-records-into-specified-ranges

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Use conditional summation as you would do in SQL :

from pyspark.sql import functions as F

df1 = df.select(
    F.sum(F.when(F.col("matches") == 0, 1).otherwise(0)).alias("0 matches"),
    F.sum(F.when(F.col("matches").between(1, 2), 1).otherwise(0)).alias("lessthan3"),
    F.sum(F.when(F.col("matches") >= 3, 1).otherwise(0)).alias("morethan3")
).drop("names")

df1.show()

#+---------+---------+---------+
#|0 matches|lessthan3|morethan3|
#+---------+---------+---------+
#|        0|        1|        3|
#+---------+---------+---------+

Another way of doing this is to group by the ranges and count:

df1 = df.withColumn(
    "range",
    F.when(F.col("matches") == 0, "0 matches")
        .when(F.col("matches").between(1, 2), "lessthan3")
        .when(F.col("matches") >= 3, "morethan3")
).groupBy("range").count()

#+---------+-----+
#|    range|count|
#+---------+-----+
#|lessthan3|    1|
#|morethan3|    3|
#+---------+-----+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...