I'm not sure how much, if any, you can get the compiler to do for you without manually vectorizing with intrinsics or a C++ vector-class wrapper (like Agner Fog's VCL, if your project's license is compatible with the GPL). There are some non-GPLed wrappers, too.
Cache-blocking a matrix multiply is a fine art (and will be important here), and it would be really nice if you could use Eigen's existing templates but with a different class that uses bitwise and
on integers, instead of multiply on floats. I'm not sure if this is possible.
I did some searching, and most of the literature about binary matrices is about producing a boolean result (including SO questions like this). A vector inner-product is done with AND as the multiply, but with XOR or OR as the addition, not popcount. Maybe there's a search-term I'm missing that describes "normal" matrices that just happen to be (0,1) matrices, but where the product won't be.
Since every millisecond matters, you're probably going to have to manually vectorize this.
It's not that vector-integer stuff is slow in general, it's just vector-integer multiply that's slow, especially compared to vector-float
FMA on recent x86 hardware (especially Intel, which has FP FMA throughput of 2x 256b vectors per clock on Haswell and later).
Since you don't need an actual multiply with boolean elements, just an AND (3 vectors per clock throughput), that's not a problem for you. The efficiency-gain from doing many more elements per vector should more than make up for any extra cost per-vector.
Of course, this assumes an integer matmul implementation using all the same cache-blocking and other optimizations as an equivalent FP matmul, and that's where the problem lies if you don't want to (or don't know how to) write it yourself, and can't find a library that will do it for you.
I'm just answering the question of how efficient it could be, with an optimal implementation. The answer to the title question is a very definite yes, it's a massive waste of time to use actual multiplication, especially with 32-bit elements.
Storage format options:
one element (0/1) per byte:
- 4x the density of
float
(cache footprint / memory bandwidth / elements per vector)
- easy to transpose with byte shuffles
- vertical ADD is easy, in case that matters (e.g. for vectorizing over an outer loop, and working on multiple rows or multiple columns at once. Can be good (avoiding horizontal sums at the end) if you have your data interleaved in a way that makes this work without extra shuffling.)
4 elements per byte, packed into the low nibble:
- 4x the density of separate bytes
- very efficient to popcount with AVX2
vpshufb
. With inputs hot in L1D cache, you could load/AND/accumulate-a-popcount with a throughput of 128 AND-result elements per clock cycle (per core), in theory. 4 fused-domain uops per clock saturates SKL/HSW front-end issue bandwidth of 4 per clock, and doesn't bottleneck on the 3 vector ALU ports, because one of the uops is a pure load. (The other load micro-fuses with the vpand
). When bottlenecked on L2 bandwidth (~one 32B load per cycle), runs at 64 elements per clock. See below.
- slower to create from integer or packed-bitmap (but not bad if you put bits into vectors in an interleaved order for efficient pack/unpack to in-order bytes, rather than forcing bits to be in order).
- hard to transpose (maybe worse than fully packed)
packed bits:
- 8x the density of separate bytes, 256 elements per AVX2 vector.
- can be created from vectors with
pmovmskb
for a non-interleaved storage order. (not very useful for creation on the fly, though, since that puts the result in an integer reg, not a vector. An interleaved bit-order is probably best, esp. for unpacking during a transpose).
- fairly efficient to popcount with AVX2: mask / shift+mask / 2x
vpshufb
. (9 fused-domain uops (8 vector-ALU uops) to AND + accumulate popcount for 256 elements (from 2 row/column vectors), vs. 8 uops (6 vector-ALU uops) for the 4-per-byte strategy (from 4 row/column vectors).) ALU port bottlenecks limit this to 96 elements per clock from L1D or L2. So this has about 1.5x the inner-product throughput of the pack4 strategy when it bottlenecks on L2 bandwidth, or 3/4 the throughput for data hot in L1D, in theory, counting just the inner loop. This is just the inner-product part, not accounting for different pack/unpack costs.
- hard to transpose (but maybe not horrible with
pmovmskb
to extract 1 bit from each byte and make them contiguous).
6 elements per bytes, 0xxx0xxx
(probably no advantages for this problem on HSW/SKL, but interesting to consider):
- 6x the density of separate bytes
- fairly easy to create from 0/1 bytes in an interleaved way, by shifting/ORing, same as the 4 bits per byte format.
optimized for efficient popcount with AVX2 vpshufb
. No need to mask before 2xvpshufb
, just 1 right-shift. (vpshufb
zeros the byte if the high bit is set, otherwise it uses the low nibble as an index. This is why it needs the masking.) Right shifting this format by 4 (vpsrld ymm0,4
) will still leave a zero in the high bit of every byte. Load+AND -> accumulate popcount is 7 fused-domain uops per vector (vmovdqa
/vpand ymm,[mem]
/vpsrld ymm,4
/2xvpshufb
/2xvpaddb
), only 6 of which need ALU ports. So HSW/SKL throughput is in theory 1 vector (of 192 elements) per 2 clocks, or 96 elements per clock. This requires an average load throughput of one 256b vector per clock, so it's right up against the L2 bandwidth bottleneck.
In theory it's the same as fully packed, but in practice it might be slightly faster or slower depending on which one schedules better (fewer AND/ADD uops stealing port 5 from shuffles, for example). Fully packed is probably more likely to come close to theoretical speed, because more of its uops can run on multiple ports. Out-of-order scheduling imperfections are less likely.
- The
pmovmskb
transpose trick doesn't work cleanly.
- Could be useful if we just needed
popcount(A[])
instead of popcount(A[] & B[])
. Or for a different microarchitecture where ALU vs. load throughput was different.
Another variation on this, 7 elements per byte could be popcounted with a single AVX512VBMI (Cannonlake?) vpermi2b
(_mm512_permutex2var_epi8
), where each index byte selects one of 128 bytes from the concatenation of two other registers. A shuffle that wide will probably be slow, but it will hopefully have better throughput than an AVX512 vpshufb
separate-nibble thing.
To count packed-8 with AVX512VBMI (but without AVX512VPOPCNTDQ), you could maybe use vpermi2b
to count the low 7, then shift+mask the top bit and just add it. (popcount of a single bit = that bit).
uint8_t
elements are easier to shuffle efficiently (since there are byte shuffles like vpshufb
), so may be worth considering if you have to transpose on the fly. Or only pack down to bits on the fly while transposing?
32-bit integers are also an option, but not a good option. Fewer elements per vector means fewer shuffle instructions in a transpose, but not by a factor of 4. The number of shuffles in a transpose may scale with something like log2(elements per vector).
This is also a big deal for cache footprint / memory bandwidth. The factor of 8 size difference can mean that doing a whole row or column only takes part of L1, instead of overflowing L1. So it can make cache-blocking easier / less important.
10k * 20k / 8 = 23.84MiB per matrix, using packed-bit elements. That's much larger than L2 cache (256kiB on Haswell, 1MiB on Skylake-AVX512), but will fit in L3 on many-core Xeon CPUs. But L3 is competitively shared by all cores (including other VMs in a cloud environment), and is much slower than L2. (Many-core Xeons like you will be running on in HPC / cloud systems have lower per-core memory bandwidth than quad-core desktops, because of higher latency to L3 cache with no increase in concurrency (see the "latency-bound platforms" section of this answer. It takes more cores to drive the same amount of memory bandwidth on a Xeon, even though the total throughput is higher. But if you can have each core mostly working out of its private L2, you gain a LOT.)
Adding up the AND results: You've arranged your loops so you need to reduce a single run of booleans to a count of the non-zeros. This is a good thing.
With 8-bit integer 0/1 elements, you can do up to 255 vpaddb
before an element could overflow. It has good throughput: 2 per clock on Haswell, 3 per clock on Skylake. With multiple accumulators, that covers a lot of vectors of AND results. Use vpsadbw
against an all-zero vector to horizontally add the bytes in a vector into 64-bit integers. Then combine your accumulators with vpaddq
, then horizontally sum it.
With packed bits, you just want to popcount the vectors of AND results. With AVX2 and your data already in vectors, you definitely want to use a
VPSHUFB
-based bit-slicing popcount. (See http://wm.ite.pl/articles/sse-popcount.html for example. You'd want to write it with intrinsics, not asm, if you have to manually vectorize it.)
You could consider packing your data 4 bits per byte, in the low nibble. That would mean one vpshufb
could count the bits in each byte of an AND result, without needing any shifting / masking. Inside the inner loop, you'd have 2 loads, vpand
, vpshufb
, vpaddb
. With proper unrolling, that should keep up with L1D load bandwidth of 2x 32B per clock, and saturate all three vector execution ports (on Haswell or Skylake). Break out of that every 128 or 255 vectors or something to accumulate the bytes of your accumulator(s) with vpsadbw
/vpaddq
. (But with cache-blocking, you probably want to break out tha