Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
332 views
in Technique[技术] by (71.8m points)

python - What data type does VectorAssembler require for an input?

The Core problem is this here

from pyspark.ml.feature import VectorAssembler
df = spark.createDataFrame([([1, 2, 3], 0, 3)], ["a", "b", "c"])
vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])
vecAssembler.transform(df).show()

with error IllegalArgumentException: Data type array<bigint> of column a is not supported.

I know this is a bit of a toy problem, but I'm trying to integrate this into a longer pipeline with steps

  • StringIndexer
  • OneHotEncoding
  • Custom UnaryTransformer to multiply all the 1's by 10
    • What datatype should be returned here?
  • Then VectorAssembler to combine the vectors into a single vector for modeling.

If I can determine the proper input datatype for the VectorAssembler I should be able to string everything together properly. I think the input type is a Vector, but I can't figure out how to build one.

question from:https://stackoverflow.com/questions/65929680/what-data-type-does-vectorassembler-require-for-an-input

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

According to the docs,

VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.

So you need to convert your array column to a vector column first (method from this question).

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
list_to_vector_udf = udf(lambda l: Vectors.dense(l), VectorUDT())
df_with_vectors = df.withColumn('a', list_to_vector_udf('a'))

Then you can use vector assembler:

vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])

vecAssembler.transform(df_with_vectors).show(truncate=False)
+-------------+---+---+---------------------+
|a            |b  |c  |features             |
+-------------+---+---+---------------------+
|[1.0,2.0,3.0]|0  |3  |[1.0,2.0,3.0,0.0,3.0]|
+-------------+---+---+---------------------+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...