python - How to convert list of dictionaries into Pyspark DataFrame

Question

Welcome To Ask or Share your Answers For Others

python - How to convert list of dictionaries into Pyspark DataFrame

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to convert list of dictionaries into Pyspark DataFrame

I want to convert my list of dictionaries into DataFrame. This is the list:

mylist = 
[
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]

This is my code:

from pyspark.sql.types import StringType

df = spark.createDataFrame(mylist, StringType())

df.show(2,False)

+-----------------------------------------+
|                                    value|
+-----------------------------------------+
|{type_activity_id=1,type_activity_id=xxx}|
|{type_activity_id=2,type_activity_id=yyy}|
|{type_activity_id=3,type_activity_id=zzz}|
+-----------------------------------------+

I assume that I should provide some mapping and types for each column, but I don't know how to do it.

Update:

I also tried this:

schema = ArrayType(
    StructType([StructField("type_activity_id", IntegerType()),
                StructField("type_activity_name", StringType())
                ]))
df = spark.createDataFrame(mylist, StringType())
df = df.withColumn("value", from_json(df.value, schema))

But then I get null values:

+-----+
|value|
+-----+
| null|
| null|
+-----+

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:02:55+0000

In the past, you were able to simply pass a dictionary to spark.createDataFrame(), but this is now deprecated:

mylist = [
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]
df = spark.createDataFrame(mylist)
#UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
#  warnings.warn("inferring schema from dict is deprecated,"

As this warning message says, you should use pyspark.sql.Row instead.

from pyspark.sql import Row
spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False)
#+----------------+------------------+
#|type_activity_id|type_activity_name|
#+----------------+------------------+
#|1               |xxx               |
#|2               |yyy               |
#|3               |zzz               |
#+----------------+------------------+

Here I used ** (keyword argument unpacking) to pass the dictionaries to the Row constructor.

Categories

python - How to convert list of dictionaries into Pyspark DataFrame

python - How to convert list of dictionaries into Pyspark DataFrame

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags