With cache()
, you use only the default storage level :
MEMORY_ONLY
for RDD
MEMORY_AND_DISK
for Dataset
With persist()
, you can specify which storage level you want for both RDD and Dataset.
From the official docs:
- You can mark an
RDD
to be persisted using the persist
() or cache
() methods on it.
- each persisted
RDD
can be stored using a different storage level
- The
cache
() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY
(store deserialized objects in memory).
Use persist()
if you want to assign a storage level other than :
MEMORY_ONLY
to the RDD
- or
MEMORY_AND_DISK
for Dataset
Interesting link for the official documentation : which storage level to choose
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…