Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
600 views
in Technique[技术] by (71.8m points)

python - How to convert list of dictionaries into Pyspark DataFrame

I want to convert my list of dictionaries into DataFrame. This is the list:

mylist = 
[
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]

This is my code:

from pyspark.sql.types import StringType

df = spark.createDataFrame(mylist, StringType())

df.show(2,False)

+-----------------------------------------+
|                                    value|
+-----------------------------------------+
|{type_activity_id=1,type_activity_id=xxx}|
|{type_activity_id=2,type_activity_id=yyy}|
|{type_activity_id=3,type_activity_id=zzz}|
+-----------------------------------------+

I assume that I should provide some mapping and types for each column, but I don't know how to do it.

Update:

I also tried this:

schema = ArrayType(
    StructType([StructField("type_activity_id", IntegerType()),
                StructField("type_activity_name", StringType())
                ]))
df = spark.createDataFrame(mylist, StringType())
df = df.withColumn("value", from_json(df.value, schema))

But then I get null values:

+-----+
|value|
+-----+
| null|
| null|
+-----+
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

In the past, you were able to simply pass a dictionary to spark.createDataFrame(), but this is now deprecated:

mylist = [
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]
df = spark.createDataFrame(mylist)
#UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
#  warnings.warn("inferring schema from dict is deprecated,"

As this warning message says, you should use pyspark.sql.Row instead.

from pyspark.sql import Row
spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False)
#+----------------+------------------+
#|type_activity_id|type_activity_name|
#+----------------+------------------+
#|1               |xxx               |
#|2               |yyy               |
#|3               |zzz               |
#+----------------+------------------+

Here I used ** (keyword argument unpacking) to pass the dictionaries to the Row constructor.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...