Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
234 views
in Technique[技术] by (71.8m points)

python - Make Dask-Yarn More Robust to Node Failures

We're using Dask to distribute compute work across an EMR cluster. We're using Dask-Yarn. We've noticed that when we experience node failures sometimes those failures will take out the container running the Scheduler and our jobs fail. I was going to move the scheduler to run locally in the same process as the main python application to increase the robustness, but then I realized that the ApplicationMaster also runs in a YARN container and will also shut down the work if it's suddenly killed.

Am I missing something about the robustness of the Dask cluster here? Is there a way to have the ApplicationMaster restart if it's suddenly terminated, without causing the Dask cluster to end? Failing that, is it possible to have the Scheduler and the ApplicationMaster containers run on the same node, so that they'd both fail in the event that node was killed?

question from:https://stackoverflow.com/questions/65943851/make-dask-yarn-more-robust-to-node-failures

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...