In a really strong memory model, emitting fence instructions would be unnecessary. All memory accesses would execute in order and all stores would be globally visible.
Memory fences are needed because current common architectures do not provide a strong memory model - x86/x64 can for example reorder reads relative to writes. (A more thorough source is "Intel? 64 and IA-32 Architectures Software Developer’s Manual, 8.2.2 Memory Ordering in P6 and More Recent Processor Families"). As an example from the gazillions, Dekker's algorithm will fail on x86/x64 without fences.
Even if the JIT produces machine code in which instructions with memory loads and stores are carefully placed, its efforts are useless if the CPU then reorders these loads and stores - which it can, as long as the illusion of sequential consistency is maintained for the current context/thread.
Risking oversimplification: it may help to visualize the loads and stores resulting from the instruction stream as a thundering herd of wild animals.
As they cross a narrow bridge (your CPU), you can never be sure about the order of the animals, since some of them will be slower, some faster, some overtake, some fall behind.
If at the start - when you emit the machine code - you partition them into groups by putting infinitely long fences between them, you can at least be sure that group A comes before group B.
Fences ensure the ordering of reads and writes. Wording is not exact, but:
- a store fence "waits" for all outstanding store (write) operations to finish, but does not affect loads.
- a load fence "waits" for all outstanding load (read) operations to finish, but does not affect stores.
- a full fence "waits" for all store and load operations to finish. It has the effect that reads and writes before the fence will get executed before the writes and loads that are on the "other side of the fence" (come later than the fence).
What the JIT emits for a full fence, depends on the (CPU) architecture and what memory ordering guarantees it provides.
Since the JIT knows exactly what architecture it runs on, it can issue the proper instruction(s).
On my x64 machine, with .NET 4.0 RC, it happens to be a lock or
.
int a = 0;
00000000 sub rsp,28h
Thread.MemoryBarrier();
00000004 lock or dword ptr [rsp],0
Console.WriteLine(a);
00000009 mov ecx,1
0000000e call FFFFFFFFEFB45AB0
00000013 nop
00000014 add rsp,28h
00000018 ret
Intel? 64 and IA-32 Architectures Software Developer’s Manual Chapter 8.1.2:
"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)."
..."Locked operations are atomic with respect to all other memory operations and all
externally visible events. Only instruction fetch and page table accesses can pass
locked instructions. Locked instructions can be used to synchronize data written by
one processor and read by another processor."
memory-ordering instructions address this specific need. MFENCE
could have been used as full barrier in the above case (at least in theory - for one, locked operations might be faster, for two it might result in different behavior). MFENCE
and its friends can be found in Chapter 8.2.5 "Strengthening or Weakening the Memory-Ordering Model".
There are some more ways to serialize stores and loads, though they are either impractical or slower than the above methods:
In chapter 8.3 you can find full serializing instructions like CPUID
. These serialize instruction flow as well: "Nothing can pass a serializing instruction and
a serializing instruction cannot pass any other instruction (read, write, instruction
fetch, or I/O)".
If you set up memory as strong uncached (UC), it will give you a strong memory model: no speculative or out-of order accesses will be allowed and all accesses will appear on the bus, therefore no need to emit an instruction. :) Of course, this will be a tad slower than usual.
...
So it depends on. If there was a computer with strong ordering guarantees, the JIT would probably emit nothing.
IA64 and other architectures have their own memory models - and thus guarantees of memory ordering (or lack of them) - and their own instructions/ways to deal with memory store/load ordering.