I am using GCC 4.8 and glibc 2.19 on an x86_64 Linux.
While playing with different input methods for a different question, I compared fscanf
and sscanf
. Specifically, I would either use fscanf
on the standard input directly:
char s[128]; int n;
while (fscanf(stdin, "%127s %d", s, &n) == 2) { }
Or I would first read the entire input into a buffer and then traverse the buffer with sscanf
. (Reading everything into the buffer takes a tiny amount of time.)
char s[128]; int n;
char const * p = my_data;
for (int b; sscanf(p, "%127s %d%n", s, &n, &b) == 2; p += b) { }
To my surprise, the fscanf
version is vastly faster. For example, processing seveal tens of thousands of lines with fscanf
takes this long:
10000 0.003927487 seconds time elapsed
20000 0.006860206 seconds time elapsed
30000 0.007933329 seconds time elapsed
40000 0.012881912 seconds time elapsed
50000 0.013516816 seconds time elapsed
60000 0.015670432 seconds time elapsed
70000 0.017393129 seconds time elapsed
80000 0.019837480 seconds time elapsed
90000 0.023925753 seconds time elapsed
Now the same with sscanf
:
10000 0.035864643 seconds time elapsed
20000 0.127150772 seconds time elapsed
30000 0.319828373 seconds time elapsed
40000 0.611551668 seconds time elapsed
50000 0.919187459 seconds time elapsed
60000 1.327831544 seconds time elapsed
70000 1.809843039 seconds time elapsed
80000 2.354809588 seconds time elapsed
90000 2.970678416 seconds time elapsed
I was using the Google perf tools to measure this. For example, for 50000 lines, the fscanf
code requires about 50M cycles, and the sscanf
code about 3300M cycles. So I broke down the top call sites with perf record
/perf report
. With fscanf
:
35.26% xf libc-2.19.so [.] _IO_vfscanf
23.91% xf [kernel.kallsyms] [k] 0xffffffff8104f45a
8.93% xf libc-2.19.so [.] _int_malloc
And with sscanf
:
98.22% xs libc-2.19.so [.] rawmemchr
0.68% xs libc-2.19.so [.] _IO_vfscanf
0.38% xs [kernel.kallsyms] [k] 0xffffffff8104f45a
So almost all of the time with sscanf
is spent in rawmemchr
! Why is this? How can the fscanf
code avoid this cost?
I tried searching for this, but the best I could come up with is this discussion of locked realloc
calls which I don't think applies here. I was also thinking that fscanf
has better memory locality (using the same buffer over and over), but that can't make such a big difference.
Does anyone have any insights in this strange discrepancy?
See Question&Answers more detail:
os