python 3.x - Increase performance of np.where() loop

Question

Welcome To Ask or Share your Answers For Others

python 3.x - Increase performance of np.where() loop

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python 3.x - Increase performance of np.where() loop

I am trying to speed up the code for the following script (ideally >4x) w/o multiprocessing. In a future step, I will implement multiprocessing, but the current speed is too slow even if I split it up to 40 cores. Therefore I'm trying to optimize to code first.

import numpy as np

def loop(x,y,q,z):
    matchlist = []
    for ind in range(len(x)):
        matchlist.append(find_match(x[ind],y[ind],q,z))
    return matchlist

def find_match(x,y,q,z):
    A = np.where(q == x)
    B = np.where(z == y)
    return np.intersect1d(A,B)


# N will finally scale up to 10^9
N = 1000
M = 300

X = np.random.randint(M, size=N)
Y = np.random.randint(M, size=N)

# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)

# convert int32 arrays to str64 arrays, to represent original data (which are strings and not numbers)
X = np.char.mod('%d', X)
Y = np.char.mod('%d', Y)
Q = np.char.mod('%d', Q)
Z = np.char.mod('%d', Z)

matchlist = loop(X,Y,Q,Z)

Some details:

I have two arrays (X and Y) which are identical in length. Each row of these arrays corresponds to one DNA sequencing read (basically strings of the letters 'A','C','G','T'; details not relevant for the example code here).

I also have two 'reference arrays' (Q and Z) which are identical in length and I want to find the occurrence (with np.where()) of every element of X within Q, as well as of every element of Y within Z (basically the find_match() function). Afterwards I want to know whether there is an overlap/intersect between the indexes found for X and Y.

Example output (matchlist; some rows of X/Y have matching indexes in Q/Y, some don't e.g. index 11):

The code works fine so far, but it would take quite long to execute with my final dataset where N=10^9 (in this code example N=1000 to make the tests quicker). 1000 rows of X/Y need about 2.29s to execute on my laptop:

Every find_match() takes about 2.48 ms to execute which is roughly 1/1000 of the final loop.

One first approach would be to combine (x with y) as well as (q with z) and then I only need to run np.where() once, but I couldn't get that to work yet.

What I've tried so far:

loop and lookup within Pandas (.loc()) but this was about 4x slower than np.where() method

The question is closely related to a recent question from philshem (Combine Numpy "where" statements), however, the solutions provided on this question do not work for my approach here.

question from:https://stackoverflow.com/questions/65883951/increase-performance-of-np-where-loop

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:20:36+0000

Numpy isn't too helpful here, since what you need is a lookup into a jagged array, with strings as the indexes.

lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
    lookup.setdefault((q, z), []).append(i)

matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]

If you don't need the output as a jagged array, but are OK with just a boolean denoting presence, and can preprocess each string to a number, there is a much faster method.

lookup = np.zeros((300, 300), dtype=bool)
lookup[Q, Z] = True
matchlist = lookup[X, Y]

You typically won't want to use this method to replace the former jagged case, as dense variants (eg. Daniel F's solution) will be memory inefficient and numpy does not support sparse arrays well. However, if more speed is needed then a sparse solution is certainly possible.

Categories

python 3.x - Increase performance of np.where() loop

python 3.x - Increase performance of np.where() loop

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags