hadoop - Spark RDD - is partition(s) always in RAM?

Question

Welcome To Ask or Share your Answers For Others

hadoop - Spark RDD - is partition(s) always in RAM?

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:45:30+0000

If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD data will reside on Spark Memory?

Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.

If I do not delete RDD, will it be in memory forever?

Spark Automatically unpersist the RDD or Dataframe if they are no longer used. In order to know if an RDD or Dataframe is cached, you can get into the Spark UI -- > Storage table and see the Memory details. You can use df.unpersist() or sqlContext.uncacheTable("sparktable") to remove the df or tables from memory. link to read more

If my dataset size exceeds available RAM size, where will data to stored?

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time, when they're needed. link to read more

If we are saying RDD is already in RAM, meaning it is in memory, what is the need to persist()? --As per comment

To answer your question, when any action triggered on RDD and if that action could not find memory, it can remove uncached/unpersisted RDDs.

In general, we persist RDD which need a lot of computation or/and shuffling (by default spark persist shuffled RDDs to avoid costly network I/O), so that when any action performed on persisted RDD, simply it will perform that action only rather than computing it again from start as per lineage graph, check RDD persistence levels here.

Categories

hadoop - Spark RDD - is partition(s) always in RAM?

hadoop - Spark RDD - is partition(s) always in RAM?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags