python - How to transform multiple categorical columns to integers maintaining shared values in PySpark?

Question

Welcome To Ask or Share your Answers For Others

python - How to transform multiple categorical columns to integers maintaining shared values in PySpark?

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to transform multiple categorical columns to integers maintaining shared values in PySpark?

Is there an easy way to transform multiple columns with shared labels into columns of integers maintaining those shared labels as integers?

Here is what I tried:

from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

df = spark.createDataFrame(
        [(0, "a", "b"), (1, "b", "b"), (2, "c", "b"), 
         (3, "a", "b"), (4, "a", "a"), (5, "c", "a")],
        ["id", "c1", "c2"])

columns = df.columns
columns.remove('id')

indexers = [StringIndexer(inputCol="{}".format(col), outputCol="{}_index".format(col)) for col in columns]
pipeline = Pipeline(stages=indexers)

indexed = pipeline.fit(df).transform(df)
indexed.show()

+---+---+---+--------+--------+
| id| c1| c2|c1_index|c2_index|
+---+---+---+--------+--------+
|  0|  a|  b|     0.0|     0.0|
|  1|  b|  b|     2.0|     0.0|
|  2|  c|  b|     1.0|     0.0|
|  3|  a|  b|     0.0|     0.0|
|  4|  a|  a|     0.0|     1.0|
|  5|  c|  a|     1.0|     1.0|
+---+---+---+--------+--------+

The result I would like to get is:

+---+---+---+--------+--------+
| id| c1| c2|c1_index|c2_index|
+---+---+---+--------+--------+
|  0|  a|  b|     0.0|     2.0|
|  1|  b|  b|     2.0|     2.0|
|  2|  c|  b|     1.0|     2.0|
|  3|  a|  b|     0.0|     2.0|
|  4|  a|  a|     0.0|     0.0|
|  5|  c|  a|     1.0|     0.0|
+---+---+---+--------+--------+

I imagine that I can extract all the unique values across the columns, build a dictionary and use it to substitute across all the categorical columns. But I wonder if there is an easier way to do it.

My system is:

python 2.7
pyspark 2.2.0

Edit:

I've tried to use the solution proposed by @chlebek. I adapted it for pyspark 2.2.0 and this is the result:

from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

df = spark.createDataFrame(
        [(0, "a", "b"), (1, "b", "b"), (2, "c", "b"), 
         (3, "a", "b"), (4, "a", "a"), (5, "c", "a")],
        ["id", "c1", "c2"])

columns = df.columns
columns.remove('id')

indexer = StringIndexer(inputCol='c1', outputCol='c1_i')
model = indexer.fit(df)
indexed = model.transform(df)

indexed.show()

model2 = model._java_obj.setInputCol('c2').setOutputCol('c2_i')
indexed2 = model2.transform(indexed)

indexed2.show()

The execution gets the next exception (I've omitted part of the output):

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-1f8dd5cc9b11> in <module>()
     18 
     19 model2 = model._java_obj.setInputCol('c2').setOutputCol('c2_i')
---> 20 indexed2 = model2.transform(indexed)
     21 
     22 indexed2.show()

[...]


AttributeError: 'DataFrame' object has no attribute '_get_object_id'

I guess that when I use model._java_obj I mess something up, but I don't know what exactly. Types for model and model2 are different and AFAIK they shoud be the same:

print(type(model))

<class 'pyspark.ml.feature.StringIndexerModel'>

print(type(model2))

<class 'py4j.java_gateway.JavaObject'>

Edit 2:

I'll add the execution of the solution recommended by @chlebek without adapting for pyspark 2.2.0:

df = spark.createDataFrame(
        [(0, "a", "b"), (1, "b", "b"), (2, "c", "b"), 
         (3, "a", "b"), (4, "a", "a"), (5, "c", "a")],
        ["id", "c1", "c2"])

columns = df.columns
columns.remove('id')

indexer = StringIndexer(inputCol='c1', outputCol='c1_i')
model = indexer.fit(df)
indexed = model.transform(df)

model2 = model.setInputCol('c2').setOutputCol('c2_i')
indexed2 = model2.transform(indexed)

indexed2.show()

Which gives the following output:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-2bbc90b5fdd3> in <module>()
     13 indexed.show()
     14 
---> 15 model2 = model.setInputCol('c2').setOutputCol('c2_i')
     16 indexed2 = model2.transform(indexed)
     17 

AttributeError: 'StringIndexerModel' object has no attribute 'setInputCol'

question from:https://stackoverflow.com/questions/65911146/how-to-transform-multiple-categorical-columns-to-integers-maintaining-shared-val

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:12:28+0000

You can't do this in one transform step. You should train your StringIndexerModel on first column model = indexer.fit(df) and then use this model with changed in/out columns on second column.

import org.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame( Seq((0, "a", "b"), (1, "b", "b"), (2, "c", "b"), (3, "a", "b"), (4, "a", "a"), (5, "c", "a"))).toDF("id", "category", "category2")

val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex")

val model =  indexer.fit(df)
val indexed = model.transform(df)

indexed.show()

+---+--------+---------+-------------+
| id|category|category2|categoryIndex|
+---+--------+---------+-------------+
|  0|       a|        b|          0.0|
|  1|       b|        b|          2.0|
|  2|       c|        b|          1.0|
|  3|       a|        b|          0.0|
|  4|       a|        a|          0.0|
|  5|       c|        a|          1.0|
+---+--------+---------+-------------+
    

val model2 = model.setInputCol("category2").setOutputCol("categoryIndex2")
val indexed2 = model2.transform(indexed).show()

+---+--------+---------+-------------+--------------+
| id|category|category2|categoryIndex|categoryIndex2|
+---+--------+---------+-------------+--------------+
|  0|       a|        b|          0.0|           2.0|
|  1|       b|        b|          2.0|           2.0|
|  2|       c|        b|          1.0|           2.0|
|  3|       a|        b|          0.0|           2.0|
|  4|       a|        a|          0.0|           0.0|
|  5|       c|        a|          1.0|           0.0|
+---+--------+---------+-------------+--------------+

edit:

instead of modyfing StringIndexerModel you can use that first model but you need to change names of dataframe to match col names in indexer model

indexed.toDF("id","c1_1","c1","c1_i_1")  
indexed2 = model.transform(indexed)

so finally you will get df with columns = ("id","c1_1","c1","c1_i_1","c1_i") and you can rename them again indexed2.toDF("id","c1","c2","c1_i","c2_i")

Categories

python - How to transform multiple categorical columns to integers maintaining shared values in PySpark?

python - How to transform multiple categorical columns to integers maintaining shared values in PySpark?

Edit:

Edit 2:

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags