Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
238 views
in Technique[技术] by (71.8m points)

c++ - Large (0,1) matrix multiplication using bitwise AND and popcount instead of actual int or float multiplies?

For multiplying large binary matrices (10Kx20K), what I usually to do is to convert the matrices to float ones and perform float matrix multiplication as integer matrix multiplication is pretty slow (have a look at here).

This time though, I'd need to perform over hundred thousands of these multiplications and even a millisecond performance improvement on average matters to me.


I want an int or float matrix as a result, because the product may have elements that aren't 0 or 1. The input matrix elements are all 0 or 1, so they can be stored as single bits.

In the inner-product between a row vector and a column vector (to produce one element of the output matrix), multiplication simplifies to bitwise AND. Addition is still addition, but we can add bits with a population-count function instead of looping over them individually.

Some other boolean/binary-matrix functions OR the bits instead of counting them, producing a bit-matrix result, but that's not what I need.


Here is a sample code showing that forming the problem as std::bitset, AND and count operations is faster than matrix multiplication.

#include <iostream>
using std::cout; using std::endl;
#include <vector>
    using std::vector;
#include <chrono>
#include <Eigen/Dense>
    using Eigen::Map; using Eigen::Matrix; using Eigen::MatrixXf;
#include <random>
    using std::random_device; using std::mt19937; using std::uniform_int_distribution;
#include <bitset>
    using std::bitset;

using std::floor;

const int NROW = 1000;
const int NCOL = 20000;

const float DENSITY = 0.4;
const float DENOMINATOR = 10.0 - (10*DENSITY);

void fill_random(vector<float>& vec) {
    random_device rd;
    mt19937 eng(rd());
    uniform_int_distribution<> distr(0, 10);
    int nnz = 0;
    for (int i = 0; i < NROW*NCOL; ++i)
        vec.push_back(floor(distr(eng)/DENOMINATOR));
}

void matmul(vector<float>& vec){
    float *p = vec.data();
    MatrixXf A = Eigen::Map<Eigen::Matrix<float, NROW, NCOL, Eigen::RowMajor>>(p);
    cout << "Eigen matrix has " << A.rows() << " rows and " << A.cols() << " columns." << endl;
    cout << "Total non-zero values : " << A.sum() << endl;
    cout << "The density of non-zero values is " <<  A.sum() * 1.0 / (A.cols()*A.rows()) << endl;

    auto start = std::chrono::steady_clock::now();
    MatrixXf B = A.transpose() * A;
    auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();
    cout << "Mat mul took " << end << " ms"<< endl;

    // Just to make sure the operation is not skipped by compiler
    cout << "Eigen coo ";
    for (int i=0; i<10; ++i)
        cout << B(0,i) << " ";
    cout << endl;
}


void bitset_op(vector<float>& vec) {
    // yeah it's not a great idea to set size at compile time but have to
    vector<bitset<NROW>> col_major(NCOL);

    // right, multiple par for isn't a good idea, maybe in a parallel block
    // Doing this for simplicity to profile second loop timing 
    // converting row major float vec to col major bool vec
    #pragma omp parallel for
    for (int j=0; j < NCOL; ++j) {
        for (int i=0; i < NROW; ++i) {
            col_major[j].set(i, vec[i*NCOL + j] && 1);
        }
    }

    auto start = std::chrono::steady_clock::now();
    vector<int> coo;
    coo.assign(NCOL*NCOL, 0);
    #pragma omp parallel for
    for (int j=0; j < NCOL; ++j) {
        for (int k=0; k<NCOL; ++k) {
            coo[j*NCOL + k] = (col_major[j]&col_major[k]).count();
        }
    }
    auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();
    cout << "bitset intersection took " << end << " ms"<< endl;

    // Just to make sure the operation is not skipped by compiler
    cout << "biset coo ";
    for (int i=0; i<10; ++i)
        cout << coo[i] << " ";
    cout << endl;
}


int main() {
    // Saving to float instead of int to speed up matmul
    vector<float> vec;
    fill_random(vec);
    matmul(vec);
    bitset_op(vec);
}

Running this with:

g++ -O3 -fopenmp -march=native -I. -std=c++11 code.cpp -o code

I get:

Eigen matrix has 1000 rows and 20000 columns.
Total non-zero values : 9.08978e+06
The density of non-zero values is 0.454489
Mat mul took 1849 ms
Eigen coo 458 206 208 201 224 205 204 199 217 210
bitset intersection took 602 ms
biset coo 458 206 208 201 224 205 204 199 217 210

As you can see, matmul as set of bitset operations is about 3x faster than Eigen's float matmul, which makes sense.

I want to emphasize that I need to perform this operation over 100K (in the HPC or cloud) and a millisecond performance improvement on average would make a difference.

I'm not bound to any specific library, C++ standard, etc. So please feel free to answer with any solution that you think is faster other than those using GPU, as I can't use it for a number of reasons.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I'm not sure how much, if any, you can get the compiler to do for you without manually vectorizing with intrinsics or a C++ vector-class wrapper (like Agner Fog's VCL, if your project's license is compatible with the GPL). There are some non-GPLed wrappers, too.

Cache-blocking a matrix multiply is a fine art (and will be important here), and it would be really nice if you could use Eigen's existing templates but with a different class that uses bitwise and on integers, instead of multiply on floats. I'm not sure if this is possible.

I did some searching, and most of the literature about binary matrices is about producing a boolean result (including SO questions like this). A vector inner-product is done with AND as the multiply, but with XOR or OR as the addition, not popcount. Maybe there's a search-term I'm missing that describes "normal" matrices that just happen to be (0,1) matrices, but where the product won't be.

Since every millisecond matters, you're probably going to have to manually vectorize this.


It's not that vector-integer stuff is slow in general, it's just vector-integer multiply that's slow, especially compared to vector-float FMA on recent x86 hardware (especially Intel, which has FP FMA throughput of 2x 256b vectors per clock on Haswell and later).

Since you don't need an actual multiply with boolean elements, just an AND (3 vectors per clock throughput), that's not a problem for you. The efficiency-gain from doing many more elements per vector should more than make up for any extra cost per-vector.

Of course, this assumes an integer matmul implementation using all the same cache-blocking and other optimizations as an equivalent FP matmul, and that's where the problem lies if you don't want to (or don't know how to) write it yourself, and can't find a library that will do it for you.

I'm just answering the question of how efficient it could be, with an optimal implementation. The answer to the title question is a very definite yes, it's a massive waste of time to use actual multiplication, especially with 32-bit elements.


Storage format options:

one element (0/1) per byte:

  • 4x the density of float (cache footprint / memory bandwidth / elements per vector)
  • easy to transpose with byte shuffles
  • vertical ADD is easy, in case that matters (e.g. for vectorizing over an outer loop, and working on multiple rows or multiple columns at once. Can be good (avoiding horizontal sums at the end) if you have your data interleaved in a way that makes this work without extra shuffling.)

4 elements per byte, packed into the low nibble:

  • 4x the density of separate bytes
  • very efficient to popcount with AVX2 vpshufb. With inputs hot in L1D cache, you could load/AND/accumulate-a-popcount with a throughput of 128 AND-result elements per clock cycle (per core), in theory. 4 fused-domain uops per clock saturates SKL/HSW front-end issue bandwidth of 4 per clock, and doesn't bottleneck on the 3 vector ALU ports, because one of the uops is a pure load. (The other load micro-fuses with the vpand). When bottlenecked on L2 bandwidth (~one 32B load per cycle), runs at 64 elements per clock. See below.
  • slower to create from integer or packed-bitmap (but not bad if you put bits into vectors in an interleaved order for efficient pack/unpack to in-order bytes, rather than forcing bits to be in order).
  • hard to transpose (maybe worse than fully packed)

packed bits:

  • 8x the density of separate bytes, 256 elements per AVX2 vector.
  • can be created from vectors with pmovmskb for a non-interleaved storage order. (not very useful for creation on the fly, though, since that puts the result in an integer reg, not a vector. An interleaved bit-order is probably best, esp. for unpacking during a transpose).
  • fairly efficient to popcount with AVX2: mask / shift+mask / 2xvpshufb. (9 fused-domain uops (8 vector-ALU uops) to AND + accumulate popcount for 256 elements (from 2 row/column vectors), vs. 8 uops (6 vector-ALU uops) for the 4-per-byte strategy (from 4 row/column vectors).) ALU port bottlenecks limit this to 96 elements per clock from L1D or L2. So this has about 1.5x the inner-product throughput of the pack4 strategy when it bottlenecks on L2 bandwidth, or 3/4 the throughput for data hot in L1D, in theory, counting just the inner loop. This is just the inner-product part, not accounting for different pack/unpack costs.
  • hard to transpose (but maybe not horrible with pmovmskb to extract 1 bit from each byte and make them contiguous).

6 elements per bytes, 0xxx0xxx (probably no advantages for this problem on HSW/SKL, but interesting to consider):

  • 6x the density of separate bytes
  • fairly easy to create from 0/1 bytes in an interleaved way, by shifting/ORing, same as the 4 bits per byte format.
  • optimized for efficient popcount with AVX2 vpshufb. No need to mask before 2xvpshufb, just 1 right-shift. (vpshufb zeros the byte if the high bit is set, otherwise it uses the low nibble as an index. This is why it needs the masking.) Right shifting this format by 4 (vpsrld ymm0,4) will still leave a zero in the high bit of every byte. Load+AND -> accumulate popcount is 7 fused-domain uops per vector (vmovdqa/vpand ymm,[mem]/vpsrld ymm,4/2xvpshufb/2xvpaddb), only 6 of which need ALU ports. So HSW/SKL throughput is in theory 1 vector (of 192 elements) per 2 clocks, or 96 elements per clock. This requires an average load throughput of one 256b vector per clock, so it's right up against the L2 bandwidth bottleneck.

    In theory it's the same as fully packed, but in practice it might be slightly faster or slower depending on which one schedules better (fewer AND/ADD uops stealing port 5 from shuffles, for example). Fully packed is probably more likely to come close to theoretical speed, because more of its uops can run on multiple ports. Out-of-order scheduling imperfections are less likely.

  • The pmovmskb transpose trick doesn't work cleanly.
  • Could be useful if we just needed popcount(A[]) instead of popcount(A[] & B[]). Or for a different microarchitecture where ALU vs. load throughput was different.

Another variation on this, 7 elements per byte could be popcounted with a single AVX512VBMI (Cannonlake?) vpermi2b (_mm512_permutex2var_epi8), where each index byte selects one of 128 bytes from the concatenation of two other registers. A shuffle that wide will probably be slow, but it will hopefully have better throughput than an AVX512 vpshufb separate-nibble thing.

To count packed-8 with AVX512VBMI (but without AVX512VPOPCNTDQ), you could maybe use vpermi2b to count the low 7, then shift+mask the top bit and just add it. (popcount of a single bit = that bit).


uint8_t elements are easier to shuffle efficiently (since there are byte shuffles like vpshufb), so may be worth considering if you have to transpose on the fly. Or only pack down to bits on the fly while transposing?

32-bit integers are also an option, but not a good option. Fewer elements per vector means fewer shuffle instructions in a transpose, but not by a factor of 4. The number of shuffles in a transpose may scale with something like log2(elements per vector).

This is also a big deal for cache footprint / memory bandwidth. The factor of 8 size difference can mean that doing a whole row or column only takes part of L1, instead of overflowing L1. So it can make cache-blocking easier / less important.

10k * 20k / 8 = 23.84MiB per matrix, using packed-bit elements. That's much larger than L2 cache (256kiB on Haswell, 1MiB on Skylake-AVX512), but will fit in L3 on many-core Xeon CPUs. But L3 is competitively shared by all cores (including other VMs in a cloud environment), and is much slower than L2. (Many-core Xeons like you will be running on in HPC / cloud systems have lower per-core memory bandwidth than quad-core desktops, because of higher latency to L3 cache with no increase in concurrency (see the "latency-bound platforms" section of this answer. It takes more cores to drive the same amount of memory bandwidth on a Xeon, even though the total throughput is higher. But if you can have each core mostly working out of its private L2, you gain a LOT.)


Adding up the AND results: You've arranged your loops so you need to reduce a single run of booleans to a count of the non-zeros. This is a good thing.

With 8-bit integer 0/1 elements, you can do up to 255 vpaddb before an element could overflow. It has good throughput: 2 per clock on Haswell, 3 per clock on Skylake. With multiple accumulators, that covers a lot of vectors of AND results. Use vpsadbw against an all-zero vector to horizontally add the bytes in a vector into 64-bit integers. Then combine your accumulators with vpaddq, then horizontally sum it.

With packed bits, you just want to popcount the vectors of AND results. With AVX2 and your data already in vectors, you definitely want to use a VPSHUFB-based bit-slicing popcount. (See http://wm.ite.pl/articles/sse-popcount.html for example. You'd want to write it with intrinsics, not asm, if you have to manually vectorize it.)

You could consider packing your data 4 bits per byte, in the low nibble. That would mean one vpshufb could count the bits in each byte of an AND result, without needing any shifting / masking. Inside the inner loop, you'd have 2 loads, vpand, vpshufb, vpaddb. With proper unrolling, that should keep up with L1D load bandwidth of 2x 32B per clock, and saturate all three vector execution ports (on Haswell or Skylake). Break out of that every 128 or 255 vectors or something to accumulate the bytes of your accumulator(s) with vpsadbw/vpaddq. (But with cache-blocking, you probably want to break out tha


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...