apache spark - Pyspark: Pass multiple columns in UDF

Question

Welcome To Ask or Share your Answers For Others

apache spark - Pyspark: Pass multiple columns in UDF

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:06:57+0000

If all columns you want to pass to UDF have the same data type you can use array as input parameter, for example:

>>> from pyspark.sql.types import IntegerType
>>> from pyspark.sql.functions import udf, array
>>> sum_cols = udf(lambda arr: sum(arr), IntegerType())
>>> spark.createDataFrame([(101, 1, 16)], ['ID', 'A', 'B']) 
...     .withColumn('Result', sum_cols(array('A', 'B'))).show()
+---+---+---+------+
| ID|  A|  B|Result|
+---+---+---+------+
|101|  1| 16|    17|
+---+---+---+------+

>>> spark.createDataFrame([(101, 1, 16, 8)], ['ID', 'A', 'B', 'C'])
...     .withColumn('Result', sum_cols(array('A', 'B', 'C'))).show()
+---+---+---+---+------+
| ID|  A|  B|  C|Result|
+---+---+---+---+------+
|101|  1| 16|  8|    25|
+---+---+---+---+------+

Categories

apache spark - Pyspark: Pass multiple columns in UDF

apache spark - Pyspark: Pass multiple columns in UDF

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags