To the best of my knowledge spark.task.cpus
controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max
defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1
then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus
if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus
accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max
).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…