Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?
The store buffer is used to track stores, in order, both before they retire and after they retire but before they commit to the L1 cache2. The store buffer conceptually is a totally local thing which doesn't really care about cache misses. The store buffer deals in "units" of individual stores of various sizes. Chips like Intel Skylake have store buffers of 50+ entries.
The line fill buffers primary deal with both loads and stores that miss in the L1 cache. Essentially, it is the path from the L1 cache to the rest of the memory subsystem and deals in cache line sized units. We don't expect the LFB to get involved if the load or store hits in the L1 cache1. Intel chips like Skylake have many fewer LFB entries, probably 10 to 12.
Is the ordering of events correct in my description?
Pretty close. Here's how I'd change your list:
- A store instructions gets decoded and split into store-data and store-address uops, which are renamed, scheduled and have a store buffer entry allocated for them.
- The store uops execute in any order or simultaneously (the two sub-items can execute in either order depending mostly on which has its dependencies satisfied first).
- The store data uop writes the store data into the store buffer.
- The store address uop does the V-P translation and writes the address(es) into the store buffer.
- At some point when all older instructions have retired, the store instruction retires. This means that the instruction is no longer speculative and the results can be made visible. At this point, the store remains in the store buffer and is called a senior store.
- The store now waits until it is at the head of the store buffer (it is the oldest not committed store), at which point it will commit (become globally observable) into the L1, if the associated cache line is present in the L1 in MESIF Modified or Exclusive state. (i.e. this core owns the line)
- If the line is not present in the required state (either missing entirely, i.e,. a cache miss, or present but in a non-exclusive state), permission to modify the line and the line data (sometimes) must be obtained from the memory subsystem: this allocates an LFB for the entire line, if one is not already allocated. This is a so-called request for ownership (RFO), which means that the memory hierarchy should return the line in an exclusive state suitable for modification, as opposed to a shared state suitable only for reading (this invalidates copies of the line present in any other private caches).
An RFO to convert Shared to Exclusive still has to wait for a response to make sure all other caches have invalidated their copies. The response to such an invalidate doesn't need to include a copy of the data because this cache already has one. It can still be called an RFO; the important part is gaining ownership before modifying a line.
6. In the miss scenario the LFB eventually comes back with the full contents of the line, which is committed to the L1 and the pending store can now commit3.
This is a rough approximation of the process. Some details may differ on some or all chips, including details which are not well understood.
As one example, in the above order, the store miss lines are not fetched until the store reaches the head of the store queue. In reality, the store subsystem may implement a type of RFO prefetch where the store queue is examined for upcoming stores and if the lines aren't present in L1, a request is started early (the actual visible commit to L1 still has to happen in order, on x86, or at least "as if" in order).
So the request and LFB use may occur as early as when step 3 completes (if RFO prefetch applies only after a store retires), or perhaps even as early as when 2.2 completes, if junior stores are subject to prefetch.
As another example, step 6 describes the line coming back from the memory hierarchy and being committed to the L1, then the store commits. It is possible that the pending store is actually merged instead with the returning data and then that is written to L1. It is also possible that the store can leave the store buffer even in the miss case and simply wait in the LFB, freeing up some store buffer entries.
1 In the case of stores that hit in the L1 cache, there is a suggestion that the LFBs are actually involved: that each store actually enters a combining buffer (which may just be an LFB) prior to being committed to the cache, such that a series of stores targeting the same cache line get combined in the cache and only need to access the L1 once. This isn't proven but in any case it is not really part of the main use of LFBs (more obvious from the fact we can't even really tell if it is happening or not).
2 The buffers that hold stores before and retirement might be two entirely different structures, with different sizes and behaviors, but here we'll refer to them as one structure.
3 The described scenarios involves the store that misses waiting at the head of the store buffer until the associated line returns. An alternate scenario is that the store data is written into the LFB used for the request, and the store buffer entry can be freed. This potentially allows some subsequent stores to be processed while the miss is in progress, subject to the strict x86 ordering requirements. This could increase store MLP.