apache spark - Convert RDD to DataFrame using pyspark

Question

asked Feb 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have a file in spark with following data

i have read this file as rdd using a=sc.textFile("/FileStore/tables/realestate.txt")

Now I need to convert this rdd into dataframe. I am using the below mentioned command

d=spark.createDataFrame(a).toDF("Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status")

But i am getting an error as TypeError: Can not infer schema for type: <class 'str'>

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-16T20:54:36+0000

You can split the column first:

d = spark.createDataFrame(a.map(lambda x: x.split('|'))).toDF("Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status")

Or equivalently, calling toDF on the RDD directly

d = a.map(lambda x: x.split('|')).toDF(["Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status"])

In fact, I'd recommend using the Spark CSV reader for this purpose, which could handle the header appropriately too:

df = spark.read.csv('/FileStore/tables/realestate.txt', header=True, inferSchema=True, sep='|')