Consider the following example that proves false sharing existence:
using type = std::atomic<std::int64_t>;
struct alignas(128) shared_t
{
type a;
type b;
} sh;
struct not_shared_t
{
alignas(128) type a;
alignas(128) type b;
} not_sh;
One thread increments a
by steps of 1, another thread increments b
. Increments compile to lock xadd
with MSVC, even though the result is unused.
For a structure where a
and b
are separated, the values accumulated in a few seconds is about ten times greater for not_shared_t
than for shared_t
.
So far expected result: separate cache lines stay hot in L1d cache, increment bottlenecks on lock xadd
throughput, false sharing is a performance disaster ping-ponging the cache line. (editor's note: later MSVC versions use lock inc
when optimization is enabled. This may widen the gap between contended vs. uncontended.)
Now I'm replacing using type = std::atomic<std::int64_t>;
with plain std::int64_t
(The non-atomic increment compiles to inc QWORD PTR [rcx]
. The atomic load in the loop happens to stop the compiler from just keeping the counter in a register until loop exit.)
The reached count for not_shared_t
is still greater than for shared_t
, but now less than twice.
| type is | variables are | a= | b= |
|---------------------------|---------------|-------------|-------------|
| std::atomic<std::int64_t> | shared | 59’052’951| 59’052’951|
| std::atomic<std::int64_t> | not_shared | 417’814’523| 416’544’755|
| std::int64_t | shared | 949’827’195| 917’110’420|
| std::int64_t | not_shared |1’440’054’733|1’439’309’339|
Why is the non-atomic case so much closer in performance?
Here is the rest of the program to complete the minimum reproducible example. (Also On Godbolt with MSVC, ready to compile/run)
std::atomic<bool> start, stop;
void thd(type* var)
{
while (!start) ;
while (!stop) (*var)++;
}
int main()
{
std::thread threads[] = {
std::thread( thd, &sh.a ), std::thread( thd, &sh.b ),
std::thread( thd, ¬_sh.a ), std::thread( thd, ¬_sh.b ),
};
start.store(true);
std::this_thread::sleep_for(std::chrono::seconds(2));
stop.store(true);
for (auto& thd : threads) thd.join();
std::cout
<< " shared: " << sh.a << ' ' << sh.b << '
'
<< "not shared: " << not_sh.a << ' ' << not_sh.b << '
';
}
See Question&Answers more detail:
os