I'm using the Anaconda distribution of Python, together with Numba, and I've written the following Python function that multiplies a sparse matrix A
(stored in a CSR format) by a dense vector x
:
@jit
def csrMult( x, Adata, Aindices, Aindptr, Ashape ):
numRowsA = Ashape[0]
Ax = numpy.zeros( numRowsA )
for i in range( numRowsA ):
Ax_i = 0.0
for dataIdx in range( Aindptr[i], Aindptr[i+1] ):
j = Aindices[dataIdx]
Ax_i += Adata[dataIdx] * x[j]
Ax[i] = Ax_i
return Ax
Here A
is a large scipy
sparse matrix,
>>> A.shape
( 56469, 39279 )
# having ~ 142,258,302 nonzero entries (so about 6.4% )
>>> type( A[0,0] )
dtype( 'float32' )
and x
is a numpy
array. Here is a snippet of code that calls the above function:
x = numpy.random.randn( A.shape[1] )
Ax = A.dot( x )
AxCheck = csrMult( x, A.data, A.indices, A.indptr, A.shape )
Notice the @jit
-decorator that tells Numba to do a just-in-time compilation for the csrMult()
function.
In my experiments, my function csrMult()
is about twice as fast as the scipy
.dot()
method. That is a pretty impressive result for Numba.
However, MATLAB still performs this matrix-vector multiplication about 6 times faster than csrMult()
. I believe that is because MATLAB uses multithreading when performing sparse matrix-vector multiplication.
Question:
How can I parallelize the outer for
-loop when using Numba?
Numba used to have a prange()
function, that made it simple to parallelize embarassingly parallel for
-loops. Unfortunately, Numba no longer has prange()
[actually, that is false, see the edit below]. So what is the correct way to parallelize this for
-loop now, that Numba's prange()
function is gone?
When prange()
was removed from Numba, what alternative did the developers of Numba have in mind?
Edit 1:
I updated to the latest version of Numba, which is .35, and prange()
is back! It was not included in version .33, the version I had been using.
That is good news, but unfortunately I am getting an error message when I attempt to parallelize my for loop using prange()
. Here is a parallel for loop example from the Numba documentation (see section 1.9.2 "Explicit Parallel Loops"), and below is my new code:
from numba import njit, prange
@njit( parallel=True )
def csrMult_numba( x, Adata, Aindices, Aindptr, Ashape):
numRowsA = Ashape[0]
Ax = np.zeros( numRowsA )
for i in prange( numRowsA ):
Ax_i = 0.0
for dataIdx in range( Aindptr[i],Aindptr[i+1] ):
j = Aindices[dataIdx]
Ax_i += Adata[dataIdx] * x[j]
Ax[i] = Ax_i
return Ax
When I call this function, using the code snippet given above, I receive the following error:
AttributeError: Failed at nopython (convert to parfors) 'SetItem'
object has no attribute 'get_targets'
Given
the above attempt to use prange
crashes, my question stands:
What is the correct way ( using prange
or an alternative method ) to parallelize this Python for
-loop?
As noted below, it was trivial to parallelize a similar for loop in C++ and obtain an 8x speedup, having been run on 20-omp-threads. There must be a way to do it using Numba, since the for loop is embarrassingly parallel (and since sparse matrix-vector multiplication is a fundamental operation in scientific computing).
Edit 2:
Here is my C++ version of csrMult()
. Parallelizing the for()
loop in the C++ version makes the code about 8x faster in my tests. This suggests to me that a similar speedup should be possible for the Python version when using Numba.
void csrMult(VectorXd& Ax, VectorXd& x, vector<double>& Adata, vector<int>& Aindices, vector<int>& Aindptr)
{
// This code assumes that the size of Ax is numRowsA.
#pragma omp parallel num_threads(20)
{
#pragma omp for schedule(dynamic,590)
for (int i = 0; i < Ax.size(); i++)
{
double Ax_i = 0.0;
for (int dataIdx = Aindptr[i]; dataIdx < Aindptr[i + 1]; dataIdx++)
{
Ax_i += Adata[dataIdx] * x[Aindices[dataIdx]];
}
Ax[i] = Ax_i;
}
}
}
See Question&Answers more detail:
os