I have an application that reads large files from a server and hangs frequently on a particular machine. It has worked successfully under RHEL5.2 for a long time. We have recently upgraded to RHEL6.1 and it now hangs regularly.
I have created a test app that reproduces the problem. It hangs approx 98 times out of 100.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/param.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <netdb.h>
#include <sys/socket.h>
#include <sys/time.h>
int mFD = 0;
void open_socket()
{
struct addrinfo hints, *res;
memset(&hints, 0, sizeof(hints));
hints.ai_socktype = SOCK_STREAM;
hints.ai_family = AF_INET;
if (getaddrinfo("localhost", "60000", &hints, &res) != 0)
{
fprintf(stderr, "Exit %d
", __LINE__);
exit(1);
}
mFD = socket(res->ai_family, res->ai_socktype, res->ai_protocol);
if (mFD == -1)
{
fprintf(stderr, "Exit %d
", __LINE__);
exit(1);
}
if (connect(mFD, res->ai_addr, res->ai_addrlen) < 0)
{
fprintf(stderr, "Exit %d
", __LINE__);
exit(1);
}
freeaddrinfo(res);
}
void read_message(int size, void* data)
{
int bytesLeft = size;
int numRd = 0;
while (bytesLeft != 0)
{
fprintf(stderr, "reading %d bytes
", bytesLeft);
/* Replacing MSG_WAITALL with 0 works fine */
int num = recv(mFD, data, bytesLeft, MSG_WAITALL);
if (num == 0)
{
break;
}
else if (num < 0 && errno != EINTR)
{
fprintf(stderr, "Exit %d
", __LINE__);
exit(1);
}
else if (num > 0)
{
numRd += num;
data += num;
bytesLeft -= num;
fprintf(stderr, "read %d bytes - remaining = %d
", num, bytesLeft);
}
}
fprintf(stderr, "read total of %d bytes
", numRd);
}
int main(int argc, char **argv)
{
open_socket();
uint32_t raw_len = atoi(argv[1]);
char raw[raw_len];
read_message(raw_len, raw);
return 0;
}
Some notes from my testing:
- If "localhost" maps to the loopback address 127.0.0.1, the app hangs on the call to recv() and NEVER returns.
- If "localhost" maps to the ip of the machine, thus routing the packets via the ethernet interface, the app completes successfully.
- When I experience a hang, the server sends a "TCP Window Full" message, and the client responds with a "TCP ZeroWindow" message (see image and attached tcpdump capture). From this point, it hangs forever with the server sending keep-alives and the client sending ZeroWindow messages. The client never seems to expand its window, allowing the transfer to complete.
- During the hang, if I examine the output of "netstat -a", there is data in the servers send queue but the clients receive queue is empty.
- If I remove the MSG_WAITALL flag from the recv() call, the app completes successfully.
- The hanging issue only arises using the loopback interface on 1 particular machine. I suspect this may all be related to timing dependencies.
- As I drop the size of the 'file', the likelihood of the hang occurring is reduced
The source for the test app can be found here:
Socket test source
The tcpdump capture from the loopback interface can be found here:
tcpdump capture
I reproduce the issue by issuing the following commands:
> gcc socket_test.c -o socket_test
> perl -e 'for (1..6000000){ print "a" }' | nc -l 60000
> ./socket_test 6000000
This sees 6000000 bytes sent to the test app which tries to read the data using a single call to recv().
I would love to hear any suggestions on what I might be doing wrong or any further ways to debug the issue.
See Question&Answers more detail:
os