I thought you couldn't perform floating point operations in the Linux kernel
You can't safely: failure to use kernel_fpu_begin()
/ kernel_fpu_end()
doesn't mean FPU instructions will fault (not on x86 at least).
Instead it will silently corrupt user-space's FPU state. This is bad; don't do that.
The compiler doesn't know what kernel_fpu_begin()
means, so it can't check / warn about code that compiles to FPU instructions outside of FPU-begin regions.
There may be a debug mode where the kernel does disable SSE, x87, and MMX instructions outside of kernel_fpu_begin
/ end
regions, but that would be slower and isn't done by default.
It is possible, though: setting CR0::TS = 1
makes x87 instructions fault, so lazy FPU context switching is possible, and there are other bits for SSE and AVX.
There are many ways for buggy kernel code to cause serious problems. This is just one of many. In C, you pretty much always know when you're using floating point (unless a typo results in a 1.
constant or something in a context that actually compiles).
Why is the FP architectural state different from integer?
Linux has to save/restore the integer state any time it enters/exits the kernel. All code needs to use integer registers (except for a giant straight-line block of FPU computation that ends with a jmp
instead of a ret
(ret
modifies rsp
).)
But kernel code avoids FPU generally, so Linux leaves the FPU state unsaved on entry from a system call, only saving before an actual context switch to a different user-space process or on kernel_fpu_begin
. Otherwise, it's common to return to the same user-space process on the same core, so FPU state doesn't need to be restored because the kernel didn't touch it. (And this is where corruption would happen if a kernel task actually did modify the FPU state. I think this goes both ways: user-space could also corrupt your FPU state).
The integer state is fairly small, only 16x 64-bit registers + RFLAGS and segment regs. FPU state is more than twice as large even without AVX: 8x 80-bit x87 registers, and 16x XMM or YMM, or 32x ZMM registers (+ MXCSR, and x87 status + control words). Also the MPX bnd0-4
registers are lumped in with "FPU". At this point "FPU state" just means all non-integer registers. On my Skylake, dmesg
says x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
See Understanding FPU usage in linux kernel; modern Linux doesn't do lazy FPU context switches by default for context switches (only for kernel/user transitions). (But that article explains what Lazy is.)
Most processes use SSE for copying/zeroing small blocks of memory in compiler-generated code, and most library string/memcpy/memset implementations use SSE/SSE2. Also, hardware supported optimized save/restore is a thing now (xsaveopt
/ xrstor), so "eager" FPU save/restore may actually do less work if some/all FP registers haven't actually been used. e.g. save just the low 128b of YMM registers if they were zeroed with vzeroupper
so the CPU knows they're clean. (And mark that fact with just one bit in the save format.)
With "eager" context switching, FPU instructions stay enabled all the time, so bad kernel code can corrupt them at any time.