Is there an easy way to transform multiple columns with shared labels into columns of integers maintaining those shared labels as integers?
Here is what I tried:
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
df = spark.createDataFrame(
[(0, "a", "b"), (1, "b", "b"), (2, "c", "b"),
(3, "a", "b"), (4, "a", "a"), (5, "c", "a")],
["id", "c1", "c2"])
columns = df.columns
columns.remove('id')
indexers = [StringIndexer(inputCol="{}".format(col), outputCol="{}_index".format(col)) for col in columns]
pipeline = Pipeline(stages=indexers)
indexed = pipeline.fit(df).transform(df)
indexed.show()
+---+---+---+--------+--------+
| id| c1| c2|c1_index|c2_index|
+---+---+---+--------+--------+
| 0| a| b| 0.0| 0.0|
| 1| b| b| 2.0| 0.0|
| 2| c| b| 1.0| 0.0|
| 3| a| b| 0.0| 0.0|
| 4| a| a| 0.0| 1.0|
| 5| c| a| 1.0| 1.0|
+---+---+---+--------+--------+
The result I would like to get is:
+---+---+---+--------+--------+
| id| c1| c2|c1_index|c2_index|
+---+---+---+--------+--------+
| 0| a| b| 0.0| 2.0|
| 1| b| b| 2.0| 2.0|
| 2| c| b| 1.0| 2.0|
| 3| a| b| 0.0| 2.0|
| 4| a| a| 0.0| 0.0|
| 5| c| a| 1.0| 0.0|
+---+---+---+--------+--------+
I imagine that I can extract all the unique values across the columns, build a dictionary and use it to substitute across all the categorical columns. But I wonder if there is an easier way to do it.
My system is:
Edit:
I've tried to use the solution proposed by @chlebek. I adapted it for pyspark 2.2.0 and this is the result:
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
df = spark.createDataFrame(
[(0, "a", "b"), (1, "b", "b"), (2, "c", "b"),
(3, "a", "b"), (4, "a", "a"), (5, "c", "a")],
["id", "c1", "c2"])
columns = df.columns
columns.remove('id')
indexer = StringIndexer(inputCol='c1', outputCol='c1_i')
model = indexer.fit(df)
indexed = model.transform(df)
indexed.show()
model2 = model._java_obj.setInputCol('c2').setOutputCol('c2_i')
indexed2 = model2.transform(indexed)
indexed2.show()
The execution gets the next exception (I've omitted part of the output):
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-1f8dd5cc9b11> in <module>()
18
19 model2 = model._java_obj.setInputCol('c2').setOutputCol('c2_i')
---> 20 indexed2 = model2.transform(indexed)
21
22 indexed2.show()
[...]
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I guess that when I use model._java_obj
I mess something up, but I don't know what exactly. Types for model and model2 are different and AFAIK they shoud be the same:
print(type(model))
<class 'pyspark.ml.feature.StringIndexerModel'>
print(type(model2))
<class 'py4j.java_gateway.JavaObject'>
Edit 2:
I'll add the execution of the solution recommended by @chlebek without adapting for pyspark 2.2.0:
df = spark.createDataFrame(
[(0, "a", "b"), (1, "b", "b"), (2, "c", "b"),
(3, "a", "b"), (4, "a", "a"), (5, "c", "a")],
["id", "c1", "c2"])
columns = df.columns
columns.remove('id')
indexer = StringIndexer(inputCol='c1', outputCol='c1_i')
model = indexer.fit(df)
indexed = model.transform(df)
model2 = model.setInputCol('c2').setOutputCol('c2_i')
indexed2 = model2.transform(indexed)
indexed2.show()
Which gives the following output:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-2bbc90b5fdd3> in <module>()
13 indexed.show()
14
---> 15 model2 = model.setInputCol('c2').setOutputCol('c2_i')
16 indexed2 = model2.transform(indexed)
17
AttributeError: 'StringIndexerModel' object has no attribute 'setInputCol'
question from:
https://stackoverflow.com/questions/65911146/how-to-transform-multiple-categorical-columns-to-integers-maintaining-shared-val