hadoop - Spark on yarn concept understanding

Question

Welcome To Ask or Share your Answers For Others

hadoop - Spark on yarn concept understanding

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

hadoop - Spark on yarn concept understanding

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.

Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:08:51+0000

Adding to other answers.

Is it necessary that spark is installed on all the nodes in the yarn cluster?

No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.

These are the visualizations of spark app deployment modes.

Spark Standalone Cluster

Spark standalone mode

In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.

YARN cluster mode

YARN client mode

This table offers a concise list of differences between these modes:

differences among Standalone, YARN Cluster and YARN Client modes

pics source

It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side) configuration files for the Hadoop cluster". Why does the client node have to install Hadoop when it is sending the job to cluster?

Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.

The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
In YARN mode the ResourceManager’s address is picked up from the Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.

Update: (2017-01-04)

Spark 2.0+ no longer requires a fat assembly jar for production deployment. source

Categories

hadoop - Spark on yarn concept understanding

hadoop - Spark on yarn concept understanding

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Spark Standalone Cluster

YARN cluster mode

YARN client mode

Update: (2017-01-04)

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags