I am coming from R and the tidyverse to PySpark due to its superior Spark handling, and I am struggling to map certain concepts from one context to the other.
In particular, suppose that I had a dataset like the following
x | y
--+--
a | 5
a | 8
a | 7
b | 1
and I wanted to add a column containing the number of rows for each x
value, like so:
x | y | n
--+---+---
a | 5 | 3
a | 8 | 3
a | 7 | 3
b | 1 | 1
In dplyr, I would just say:
import(tidyverse)
df <- read_csv("...")
df %>%
group_by(x) %>%
mutate(n = n()) %>%
ungroup()
and that would be that. I can do something almost as simple in PySpark if I'm looking to summarize by number of rows:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
spark.read.csv("...")
.groupBy(col("x"))
.count()
.show()
And I thought I understood that withColumn
was equivalent to dplyr's mutate
. However, when I do the following, PySpark tells me that withColumn
is not defined for groupBy
data:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count
spark = SparkSession.builder.getOrCreate()
spark.read.csv("...")
.groupBy(col("x"))
.withColumn("n", count("x"))
.show()
In the short run, I can simply create a second dataframe containing the counts and join it to the original dataframe. However, it seems like this could become inefficient in the case of large tables. What is the canonical way to accomplish this?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…