Related to Understanding `_mm_prefetch`.
I understood that _mm_prefetch()
causes the requested value to be fetched into processor's cache, and my code will be executed while something pre-fetches.
However, my VS2017 profiler states that 5.7% is spent on the line that accesses my cache
and 8.63% on the _mm_prefetch
line. Is the profiler mistaken? If I am waiting for the data to be fetched, what do I need it for? I could wait in the next function call, when I need it...
On the other hand, the overall timing shows significant benefit of that prefetch call.
So the question is: is the data being fetch asynchronously?
Additional information.
I have multiple caches, for various key width, up to 32-bit keys (that I am currently profiling). The access to cache and pre-fetching are extracted into separate __declspec(noinline)
functions to isolate them from surrounding code.
uint8_t* cache[33];
__declspec(noinline)
uint8_t get_cached(uint8_t* address) {
return *address;
}
__declspec(noinline)
void prefetch(uint8_t* pcache) {
_mm_prefetch((const char*)pcache, _MM_HINT_T0);
}
int foo(const uint64_t seq64) {
uint64_t key = seq64 & 0xFFFFFFFF;
uint8_t* pcache = cache[32];
int x = get_cached(pcache + key);
key = (key * 2) & 0xFFFFFFFF;
pcache += key;
prefetch(pcache);
// code that uses x
}
The profiler shows 7.22% for int x = get_cached(pcache + key);
line and 8.97% for prefetch(pcache);
, while surrounding code shows 0.40-0.45% per line.
question from:
https://stackoverflow.com/questions/65852218/is-mm-prefetch-asynchronous-profiling-shows-a-lot-of-cycles-on-it 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…