I'm trying to divide people into age range with
from pyspark import SparkFiles
from pyspark.sql import functions as fn
## Import data
url_users = "https://raw.githubusercontent.com/leanhdung1994/BigData/main/users.csv"
spark.sparkContext.addFile(url_users)
users_from_file = spark.read.csv("file://" + SparkFiles.get("users.csv"), header = True, sep = ",", inferSchema = True)
## Generate column age
reference_date = date(2017, 12, 31)
from pyspark.sql.types import IntegerType
def cal_age(born):
return reference_date.year - born.year - ((reference_date.month, reference_date.day) < (born.month, born.day))
users_from_file = users_from_file.withColumn('age', cal_age_udf(fn.to_date(fn.col('birth_date'))))
## Generate column range
users_from_file1 = users_from_file.withColumn('range', fn.when(fn.col("age") <= 25, 1)fn.when(fn.col("age") <= 35, 2).fn.otherwise(3))
users_from_file1.show()
Then it returns an error
SyntaxError: invalid syntax
File "<command-2296735704765764>", line 3
users_from_file1 = users_from_file.withColumn('range', fn.when(fn.col("age") <= 25, 1)fn.when(fn.col("age") <= 35, 2).fn.otherwise(3))
^
SyntaxError: invalid syntax
Could you please elaborate more on this nested when
? This syntax of When
is from this answer, but it does not work.
question from:
https://stackoverflow.com/questions/65898638/why-this-nested-when-does-not-work-in-pyspark 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…