You are missing some crucial information: What compiler do you use? What do you use to measure the performance of the microbenchmark? What standard library implementation do you use?
My system:
g++ (GCC) 4.9.1 20140903 (prerelease)
GLIBCXX_3.4.20
Anyhow, I ran your examples, but reserved the proper size of the vectors first to get rid of the memory allocation overhead. With that, I funnily observe the opposite something interesting - the reverse of what you see:
g++ -std=c++11 -O2 pair.cpp -o pair
perf stat -r 10 -d ./pair
Performance counter stats for './pair' (10 runs):
1647.045151 task-clock:HG (msec) # 0.993 CPUs utilized ( +- 1.94% )
346 context-switches:HG # 0.210 K/sec ( +- 40.13% )
7 cpu-migrations:HG # 0.004 K/sec ( +- 22.01% )
182,978 page-faults:HG # 0.111 M/sec ( +- 0.04% )
3,394,685,602 cycles:HG # 2.061 GHz ( +- 2.24% ) [44.38%]
2,478,474,676 stalled-cycles-frontend:HG # 73.01% frontend cycles idle ( +- 1.24% ) [44.55%]
1,550,747,174 stalled-cycles-backend:HG # 45.68% backend cycles idle ( +- 1.60% ) [44.66%]
2,837,484,461 instructions:HG # 0.84 insns per cycle
# 0.87 stalled cycles per insn ( +- 4.86% ) [55.78%]
526,077,681 branches:HG # 319.407 M/sec ( +- 4.52% ) [55.82%]
829,623 branch-misses:HG # 0.16% of all branches ( +- 4.42% ) [55.74%]
594,396,822 L1-dcache-loads:HG # 360.887 M/sec ( +- 4.74% ) [55.59%]
20,842,113 L1-dcache-load-misses:HG # 3.51% of all L1-dcache hits ( +- 0.68% ) [55.46%]
5,474,166 LLC-loads:HG # 3.324 M/sec ( +- 1.81% ) [44.23%]
<not supported> LLC-load-misses:HG
1.658671368 seconds time elapsed ( +- 1.82% )
versus:
g++ -std=c++11 -O2 tuple.cpp -o tuple
perf stat -r 10 -d ./tuple
Performance counter stats for './tuple' (10 runs):
996.090514 task-clock:HG (msec) # 0.996 CPUs utilized ( +- 2.41% )
102 context-switches:HG # 0.102 K/sec ( +- 64.61% )
4 cpu-migrations:HG # 0.004 K/sec ( +- 32.24% )
181,701 page-faults:HG # 0.182 M/sec ( +- 0.06% )
2,052,505,223 cycles:HG # 2.061 GHz ( +- 2.22% ) [44.45%]
1,212,930,513 stalled-cycles-frontend:HG # 59.10% frontend cycles idle ( +- 2.94% ) [44.56%]
621,104,447 stalled-cycles-backend:HG # 30.26% backend cycles idle ( +- 3.48% ) [44.69%]
2,700,410,991 instructions:HG # 1.32 insns per cycle
# 0.45 stalled cycles per insn ( +- 1.66% ) [55.94%]
486,476,408 branches:HG # 488.386 M/sec ( +- 1.70% ) [55.96%]
959,651 branch-misses:HG # 0.20% of all branches ( +- 4.78% ) [55.82%]
547,000,119 L1-dcache-loads:HG # 549.147 M/sec ( +- 2.19% ) [55.67%]
21,540,926 L1-dcache-load-misses:HG # 3.94% of all L1-dcache hits ( +- 2.73% ) [55.43%]
5,751,650 LLC-loads:HG # 5.774 M/sec ( +- 3.60% ) [44.21%]
<not supported> LLC-load-misses:HG
1.000126894 seconds time elapsed ( +- 2.47% )
as you can see, in my case the reason are the much higher number of stalled cycles, both in the frontend as well as in the backend.
Now where does this come from? I bet it comes down to some failed inlining, similar to what is explained here: std::vector performance regression when enabling C++11
Indeed, enabling -flto
equalizes the results for me:
Performance counter stats for './pair' (10 runs):
1021.922944 task-clock:HG (msec) # 0.997 CPUs utilized ( +- 1.15% )
63 context-switches:HG # 0.062 K/sec ( +- 77.23% )
5 cpu-migrations:HG # 0.005 K/sec ( +- 34.21% )
195,396 page-faults:HG # 0.191 M/sec ( +- 0.00% )
2,109,877,147 cycles:HG # 2.065 GHz ( +- 0.92% ) [44.33%]
1,098,031,078 stalled-cycles-frontend:HG # 52.04% frontend cycles idle ( +- 0.93% ) [44.46%]
701,553,535 stalled-cycles-backend:HG # 33.25% backend cycles idle ( +- 1.09% ) [44.68%]
3,288,420,630 instructions:HG # 1.56 insns per cycle
# 0.33 stalled cycles per insn ( +- 0.88% ) [55.89%]
672,941,736 branches:HG # 658.505 M/sec ( +- 0.80% ) [56.00%]
660,278 branch-misses:HG # 0.10% of all branches ( +- 2.05% ) [55.93%]
474,314,267 L1-dcache-loads:HG # 464.139 M/sec ( +- 1.32% ) [55.73%]
19,481,787 L1-dcache-load-misses:HG # 4.11% of all L1-dcache hits ( +- 0.80% ) [55.51%]
5,155,678 LLC-loads:HG # 5.045 M/sec ( +- 1.69% ) [44.21%]
<not supported> LLC-load-misses:HG
1.025083895 seconds time elapsed ( +- 1.03% )
and for tuple:
Performance counter stats for './tuple' (10 runs):
1018.980969 task-clock:HG (msec) # 0.999 CPUs utilized ( +- 0.47% )
8 context-switches:HG # 0.008 K/sec ( +- 29.74% )
3 cpu-migrations:HG # 0.003 K/sec ( +- 42.64% )
195,396 page-faults:HG # 0.192 M/sec ( +- 0.00% )
2,103,574,740 cycles:HG # 2.064 GHz ( +- 0.30% ) [44.28%]
1,088,827,212 stalled-cycles-frontend:HG # 51.76% frontend cycles idle ( +- 0.47% ) [44.56%]
697,438,071 stalled-cycles-backend:HG # 33.15% backend cycles idle ( +- 0.41% ) [44.76%]
3,305,631,646 instructions:HG # 1.57 insns per cycle
# 0.33 stalled cycles per insn ( +- 0.21% ) [55.94%]
675,175,757 branches:HG # 662.599 M/sec ( +- 0.16% ) [56.02%]
656,205 branch-misses:HG # 0.10% of all branches ( +- 0.98% ) [55.93%]
475,532,976 L1-dcache-loads:HG # 466.675 M/sec ( +- 0.13% ) [55.69%]
19,430,992 L1-dcache-load-misses:HG # 4.09% of all L1-dcache hits ( +- 0.20% ) [55.49%]
5,161,624 LLC-loads:HG # 5.065 M/sec ( +- 0.47% ) [44.14%]
<not supported> LLC-load-misses:HG
1.020225388 seconds time elapsed ( +- 0.48% )
So remember, -flto
is your friend and failed inlining can have extreme results on heavily templated code. Use perf stat
to find out what's happening.