python - What data type does VectorAssembler require for an input?

Question

Welcome To Ask or Share your Answers For Others

python - What data type does VectorAssembler require for an input?

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - What data type does VectorAssembler require for an input?

The Core problem is this here

from pyspark.ml.feature import VectorAssembler
df = spark.createDataFrame([([1, 2, 3], 0, 3)], ["a", "b", "c"])
vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])
vecAssembler.transform(df).show()

with error IllegalArgumentException: Data type array<bigint> of column a is not supported.

I know this is a bit of a toy problem, but I'm trying to integrate this into a longer pipeline with steps

StringIndexer
OneHotEncoding
Custom UnaryTransformer to multiply all the 1's by 10
- What datatype should be returned here?
Then VectorAssembler to combine the vectors into a single vector for modeling.

If I can determine the proper input datatype for the VectorAssembler I should be able to string everything together properly. I think the input type is a Vector, but I can't figure out how to build one.

question from:https://stackoverflow.com/questions/65929680/what-data-type-does-vectorassembler-require-for-an-input

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:06:11+0000

According to the docs,

VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.

So you need to convert your array column to a vector column first (method from this question).

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
list_to_vector_udf = udf(lambda l: Vectors.dense(l), VectorUDT())
df_with_vectors = df.withColumn('a', list_to_vector_udf('a'))

Then you can use vector assembler:

vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])

vecAssembler.transform(df_with_vectors).show(truncate=False)
+-------------+---+---+---------------------+
|a            |b  |c  |features             |
+-------------+---+---+---------------------+
|[1.0,2.0,3.0]|0  |3  |[1.0,2.0,3.0,0.0,3.0]|
+-------------+---+---+---------------------+

Categories

python - What data type does VectorAssembler require for an input?

python - What data type does VectorAssembler require for an input?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags