We're using Dask to distribute compute work across an EMR cluster. We're using Dask-Yarn. We've noticed that when we experience node failures sometimes those failures will take out the container running the Scheduler and our jobs fail. I was going to move the scheduler to run locally in the same process as the main python application to increase the robustness, but then I realized that the ApplicationMaster also runs in a YARN container and will also shut down the work if it's suddenly killed.
Am I missing something about the robustness of the Dask cluster here? Is there a way to have the ApplicationMaster restart if it's suddenly terminated, without causing the Dask cluster to end? Failing that, is it possible to have the Scheduler and the ApplicationMaster containers run on the same node, so that they'd both fail in the event that node was killed?
question from:
https://stackoverflow.com/questions/65943851/make-dask-yarn-more-robust-to-node-failures 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…