Everything to understand about lineage is in the definition of RDD
.
So let's review that :
RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure
So there is mainly 2 things to understand:
Unfortunately, these topics are quite long to discuss in a single answer. I recommend you take some time reading them along with this following article about Data Lineage.
And now to answer your question and doubts:
If an executor fails computing your data, after 15 minutes, it will go back to your last checkpoint, whether it's from the source or cache in memory and/or on disk.
Thus, it will not save you those 15 minutes that you have mentioned!
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…