POSIX Clocks
I wrote a benchmark for POSIX clock sources:
- time (s) => 3 cycles
- ftime (ms) => 54 cycles
- gettimeofday (us) => 42 cycles
- clock_gettime (ns) => 9 cycles (CLOCK_MONOTONIC_COARSE)
- clock_gettime (ns) => 9 cycles (CLOCK_REALTIME_COARSE)
- clock_gettime (ns) => 42 cycles (CLOCK_MONOTONIC)
- clock_gettime (ns) => 42 cycles (CLOCK_REALTIME)
- clock_gettime (ns) => 173 cycles (CLOCK_MONOTONIC_RAW)
- clock_gettime (ns) => 179 cycles (CLOCK_BOOTTIME)
- clock_gettime (ns) => 349 cycles (CLOCK_THREAD_CPUTIME_ID)
- clock_gettime (ns) => 370 cycles (CLOCK_PROCESS_CPUTIME_ID)
- rdtsc (cycles) => 24 cycles
These numbers are from an Intel Core i7-4771 CPU @ 3.50GHz on Linux 4.0. These measurements were taken using the TSC register and running each clock method thousands of times and taking the minimum cost value.
You'll want to test on the machines you intend to run on though as how these are implemented varies from hardware and kernel version. The code can be found here. It relies on the TSC register for cycle counting, which is in the same repo (tsc.h).
TSC
Access the TSC (processor time-stamp counter) is the most accurate and cheapest way to time things. Generally, this is what the kernel is using itself. It's also quite straight-forward on modern Intel chips as the TSC is synchronized across cores and unaffected by frequency scaling. So it provides a simple, global time source. You can see an example of using it here with a walkthrough of the assembly code here.
The main issue with this (other than portability) is that there doesn't seem to be a good way to go from cycles to nanoseconds. The Intel docs as far as I can find state that the TSC runs at a fixed frequency, but that this frequency may differ from the processors stated frequency. Intel doesn't appear to provide a reliable way to figure out the TSC frequency. The Linux kernel appears to solve this by testing how many TSC cycles occur between two hardware timers (see here).
Memcached
Memcached bothers to do the cache method. It may simply be to make sure the performance is more predictable across platforms, or scale better with multiple cores. It may also no be a worthwhile optimization.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…