What you're describing is the classic problem with PSOCK clusters: makeCluster
hangs. It can hang for dozens of reasons because it has to create all of the processes, called "worker" processes, that will perform the actual work of the "cluster", and that involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK
function, which will create a socket connection back to the master and then execute the slaveLoop
function where it will eventually execute the tasks sent to it by the master. If anything goes wrong starting any of the worker processes (and trust me: a lot can go wrong), the master will hang while executing socketConnection
, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.
For many failure scenarios, using the outfile
argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, I go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.
Here's an example:
> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
At this point, your R session is hung because it's executing socketConnection
, just as you described. It's now your job to open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster
should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.
To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:
$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
In that R session, you can put a breakpoint on the .slaveRSOCK
function and then execute it:
> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()
Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop
and makeSOCKmaster
functions. In your case, I assume that it will hang trying to create the socket connection, in which case the title of your question will be appropriate.
For more information on this kind of problem, see my answer to a similar question.
UPDATE
Now that this particular problem has been resolved, I can add two tips for debugging makePSOCKcluster
problems:
- Check to see if anything in your .Rprofile only works in interactive mode
- On Windows, use the Rterm command rather than Rgui so that you're more likely to see error messages and output from using
outfile=''
.