How and why can a cache decrease the performances ?
Let's use a simple example to demonstrate that :
// Some data
val df = spark.range(100)
df.join(df, Seq("id")).filter('id <20).explain(true)
Here, the catalyst plan will optimize this join by doing a filter on each dataframe before joining, to reduce the amount of data that will get shuffled.
== Optimized Logical Plan ==
Project [id#0L]
+- Join Inner, (id#0L = id#69L)
:- Filter (id#0L < 20)
: +- Range (0, 100, step=1, splits=Some(4))
+- Filter (id#69L < 20)
+- Range (0, 100, step=1, splits=Some(4))
If we cache the query after the join, the query won't be as optimized, as we can see here :
df.join(df, Seq("id")).cache.filter('id <20).explain(true)
== Optimized Logical Plan ==
Filter (id#0L < 20)
+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#74L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=4)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 100, step=1, splits=4)
The filter is done at the very end ...
Why so ? Because a cache
writes on the disk the dataframe. So every consequent queries will use this cached / written on disk DataFrame, and so it will optimize only the part of the query AFTER the cache. We can check that with the same example !
df.join(df, Seq("id")).cache.join(df, Seq("id")).filter('id <20).explain(true)
== Optimized Logical Plan ==
Project [id#0L]
+- Join Inner, (id#0L = id#92L)
:- Filter (id#0L < 20)
: +- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *Project [id#0L]
: +- *BroadcastHashJoin [id#0L], [id#74L], Inner, BuildRight
: :- *Range (0, 100, step=1, splits=4)
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
: +- *Range (0, 100, step=1, splits=4)
+- Filter (id#92L < 20)
+- Range (0, 100, step=1, splits=Some(4))
The filter is done before the second join, but after the first one because it is cached.
How to avoid ?
By knowing what you do ! You can simply compares catalyst plans and see what optimizations Spark is missing.