python - Where does Spark store RDD´s and Spark Dataframes within ongoing Spark Application

Question

Welcome To Ask or Share your Answers For Others

python - Where does Spark store RDD´s and Spark Dataframes within ongoing Spark Application

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Where does Spark store RDD´s and Spark Dataframes within ongoing Spark Application

I am running Spark in Kubernetes as Standalone Spark Cluster Manager with two Spark Workers. I use Jupyter to set up Spark Applications. The DeployMode is set to "Client" so when the driver process is generated it will run in the Pod where Jupyter runs. We read from Amazon S3 Proxy a CSV file with request.get and transform it to a RDD and afterwards to a Spark Dataframe. For reading the CSV file from S3 we are not using the spark.read method but request.get(). The whole Process from reading to Spark Dataframe happens in a function which returns the Dataframe.

S3PROXY == Url to proxy
def loadFromS3intoSparkDataframe(s3PathNameCsv):
    s3_rdd = spark2.sparkContext.parallelize(
                requests.get(S3PROXY + "/object", params="key={0}".format(s3PathNameCsv)).content.decode("UTF-8").split('
'),24
            ).map(lambda x: x.split(','))
    header = s3_rdd.first()
    return s3_rdd.filter(lambda row:row != header).toDF(header)

The RAM consumption for keeping this Spark Dataframe stored is 5 gb, the source CSV File is only 1 gb in size. The 5gb RAM consumption remains in the Driver Process. Some co-workers of mine say, there should be an option to permanently transfer the in memory storage to the Spark-Worker Nodes, to the Spark Executors. As far as i understood, this is only possible as copy with persist() or cache().

So my question is, is my understanding correct that by default the RDDs and Dataframes are stored in the driver process memory? And if so, is it possible to transfer the variables throughout the whole existence of the Spark Application to the executors? And is the 1 to 5 gb DataTransformation uncommon?

question from:https://stackoverflow.com/questions/65617532/where-does-spark-store-rdd%c2%b4s-and-spark-dataframes-within-ongoing-spark-applicati

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - Where does Spark store RDD´s and Spark Dataframes within ongoing Spark Application

python - Where does Spark store RDD´s and Spark Dataframes within ongoing Spark Application

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

python - Where does Spark store RDD&#180;s and Spark Dataframes within ongoing Spark Application

python - Where does Spark store RDD&#180;s and Spark Dataframes within ongoing Spark Application

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

python - Where does Spark store RDD´s and Spark Dataframes within ongoing Spark Application

python - Where does Spark store RDD´s and Spark Dataframes within ongoing Spark Application