So far I have run Spark only on Linux machines and VMs (bridged networking) but now I am interesting on utilizing more computers as slaves. It would be handy to distribute a Spark Slave Docker container on computers and having them automatically connecting themselves to a hard-coded Spark master ip. This short of works already but I am having trouble configuring the right SPARK_LOCAL_IP (or --host parameter for start-slave.sh) on slave containers.
I think I correctly configured the SPARK_PUBLIC_DNS env variable to match the host machine's network-accessible ip (from 10.0.x.x address space), at least it is shown on Spark master web UI and accessible by all machines.
I have also set SPARK_WORKER_OPTS and Docker port forwards as instructed at http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html, but in my case the Spark master is running on an other machine and not inside Docker. I am launching Spark jobs from an other machine within the network, possibly also running a slave itself.
Things that I've tried:
- Not configure SPARK_LOCAL_IP at all, slave binds to container's ip (like 172.17.0.45), cannot be connected to from master or driver, computation still works most of the time but not always
- Bind to 0.0.0.0, slaves talk to master and establish some connection but it dies, an other slave shows up and goes away, they continue looping like this
- Bind to host ip, start fails as that ip is not visible within the container but would be reachable by others as port-forwarding is configured
I wonder why isn't the configured SPARK_PUBLIC_DNS being used when connecting to slaves? I thought SPARK_LOCAL_IP would only affect on local binding but not being revealed to external computers.
At https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/connectivity_issues.html they instruct to "set SPARK_LOCAL_IP to a cluster-addressable hostname for the driver, master, and worker processes", is this the only option? I would avoid the extra DNS configuration and just use ips to configure traffic between computers. Or is there an easy way to achieve this?
Edit:
To summarize the current set-up:
- Master is running on Linux (VM at VirtualBox on Windows with bridged networking)
- Driver submits jobs from an other Windows machine, works great
- Docker image for starting up slaves is distributed as a "saved" .tar.gz file, loaded (curl xyz | gunzip | docker load) and started on other machines within the network, has this probem with private/public ip configuration
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…