performance - Why does my code run slower when I remove bounds checks?

Question

Welcome To Ask or Share your Answers For Others

performance - Why does my code run slower when I remove bounds checks?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

performance - Why does my code run slower when I remove bounds checks?

I'm writing a linear algebra library in Rust.

I have a function to get a reference to a matrix cell at a given row and column. This function starts with a pair of assertions that the row and column are within bounds:

#[inline(always)]
pub fn get(&self, row: usize, col: usize) -> &T {
    assert!(col < self.num_cols.as_nat());
    assert!(row < self.num_rows.as_nat());
    unsafe {
        self.get_unchecked(row, col)
    }
}

In tight loops, I thought it might be faster to skip the bounds check, so I provide a get_unchecked method:

#[inline(always)]
pub unsafe fn get_unchecked(&self, row: usize, col: usize) -> &T {
    self.data.get_unchecked(self.row_col_index(row, col))
}

The bizarre thing is, when I use these methods to implement matrix multiplication (via row and column iterators), my benchmarks show that it actually goes about 33% faster when I check the bounds. Why is this happening?

I've tried this on two different computers, one running Linux and the other OSX, and both show the effect.

The full code is on github. The relevant file is lib.rs. Functions of interest are:

get at line 68
get_unchecked at line 81
next at line 551
mul at line 796
matrix_mul (benchmark) at line 1038

Note that I'm using type-level numbers to parameterise my matrices (with the option for dynamic sizes too via dummy tagged types), so the benchmark is multiplying two 100x100 matrices.

UPDATE:

I've significantly simplified the code, removing stuff not directly used in the benchmark and removing generic parameters. I also wrote an implementation of multiplication without using iterators, and that version doesn't cause the same effect. See here for this version of the code. Cloning the minimal-performance branch and running cargo bench will benchmark the two different implementations of multiplication (note that the assertions are commented out to start with in that branch).

Also of note is that if I change the get* functions to return copies of the data instead of references (f64 instead of &f64), the effect disappears (but the code is slightly slower all round).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:25:06+0000

It's not a complete answer because I haven't tested my claims, but this might explain it. Either ways, the only way to know for sure is to generate the LLVM IR and the assembler output. If you need a manual for LLVM IR, you can find it here: http://llvm.org/docs/LangRef.html .

Anyways, enough about that. Let's say you have this code:

#[inline(always)]
pub unsafe fn get_unchecked(&self, row: usize, col: usize) -> &T {
    self.data.get_unchecked(self.row_col_index(row, col))
}

The compiler here changes this into an indirect load, which will probably be optimized in a tight loop. It's interesting to note that each load has a possibility to go wrong: if your data isn't available, it'll trigger an out-of-bounds.

In the case with the bounds check combined with the tight loop, LLVM does a little trick. Because the load is in a tight loop (a matrix multiplication) and because the result of the bounds check depends on the bounds of the loop, it will remove the bounds check from the loop and put it around the loop. In other words, the loop itself will remain exactly the same, but with an extra bounds check.

In other words, the code is exactly the same, with some minor differences.

So what changed? Two things:

If we have the additional bounds check, there's no possibility anymore for an out-of-bounds load. This might trigger an optimization that wasn't possible before. Still, considering how these checks are usually implemented, this wouldn't be my guess.
Another thing to consider is that the word 'unsafe' might trigger some behavior, like an additional condition, pin data or disable the GC, etc. I'm not sure about this exact behavior in Rust; the only way to find out these details is to look at the LLVM IR.

Categories

performance - Why does my code run slower when I remove bounds checks?

performance - Why does my code run slower when I remove bounds checks?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags