Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
319 views
in Technique[技术] by (71.8m points)

rdd - Difference between sc.textFile and spark.read.text in Spark

I am trying to read a simple text file into a Spark RDD and I see that there are two ways of doing so :

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
textRDD1 = sc.textFile("hobbit.txt")
textRDD2 = spark.read.text('hobbit.txt').rdd

then I look into the data and see that the two RDDs are structured differently

textRDD1.take(5)

['The king beneath the mountain',
 'The king of carven stone',
 'The lord of silver fountain',
 'Shall come unto his own',
 'His throne shall be upholden']

textRDD2.take(5)

[Row(value='The king beneath the mountain'),
 Row(value='The king of carven stone'),
 Row(value='The lord of silver fountain'),
 Row(value='Shall come unto his own'),
 Row(value='His throne shall be upholden')]

Based on this, all subsequent processing has to be changed to reflect the presence of the 'value'

My questions are

  • What is the implication of using these two ways of reading a text file?
  • Under what circumstances should we use which method?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

To answer (a),

sc.textFile(...) returns a RDD[String]

textFile(String path, int minPartitions)

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

spark.read.text(...) returns a DataSet[Row] or a DataFrame

text(String path)

Loads text files and returns a DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any.

For (b), it really depends on your use case. Since you are trying to create a RDD here, you should go with sc.textFile. You can always convert a dataframe to a rdd and vice-versa.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...