I've set up a distributed Hadoop environment within VirtualBox: 4 virtual Ubuntu 11.10 installations, one acting as the master node, the other three as slaves. I followed this tutorial to get the single-node version up and running and then converted to the fully-distributed version. It was working just fine when I was running 11.04; however, when I upgraded to 11.10, it broke. Now all my slaves' logs show the following error message, repeated ad nauseum:
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 0 time(s).
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 1 time(s).
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 2 time(s).
And so on. I've found other instances of this error message on the Internet (and StackOverflow) but none of the solutions have worked (tried changing the core-site.xml and mapred-site.xml entries to be the IP address rather than hostname; quadruple-checked /etc/hosts
on all slaves and master; master can SSH password-less into all slaves). I even tried reverting each slave back to a single-node setup, and they would all work fine in this case (on that note, the master always works fine as both a Datanode and the Namenode).
The only symptom I've found that would seem to give a lead is that from any of the slaves, when I attempt a telnet 192.168.1.10 54310
, I get Connection refused
, suggesting there is some rule blocking access (which must have gone into effect when I upgraded to 11.10).
My /etc/hosts.allow
has not changed, however. I tried the rule ALL: 192.168.1.
, but it did not change the behavior.
Oh yes, and netstat
on the master clearly shows tcp ports 54310 and 54311 are listening.
Anyone have any suggestions to get the slave Datanodes to recognize the Namenode?
EDIT #1: In doing some poking around with nmap (see comments on this post), I'm thinking the issue is in my /etc/hosts
files. This is what is listed for the master VM:
127.0.0.1 localhost
127.0.1.1 master
192.168.1.10 master
192.168.1.11 slave1
192.168.1.12 slave2
192.168.1.13 slave3
For each slave VM:
127.0.0.1 localhost
127.0.1.1 slaveX
192.168.1.10 master
192.168.1.1X slaveX
Unfortunately, I'm not sure what I changed, but the NameNode is now always dying with the exception of trying to bind a port "that's already in use" (127.0.1.1:54310). I'm clearly doing something wrong with the hostnames and IP addresses, but I'm really not sure what it is. Thoughts?
See Question&Answers more detail:
os