Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
201 views
in Technique[技术] by (71.8m points)

python 3.x - Increase performance of np.where() loop

I am trying to speed up the code for the following script (ideally >4x) w/o multiprocessing. In a future step, I will implement multiprocessing, but the current speed is too slow even if I split it up to 40 cores. Therefore I'm trying to optimize to code first.

import numpy as np

def loop(x,y,q,z):
    matchlist = []
    for ind in range(len(x)):
        matchlist.append(find_match(x[ind],y[ind],q,z))
    return matchlist

def find_match(x,y,q,z):
    A = np.where(q == x)
    B = np.where(z == y)
    return np.intersect1d(A,B)


# N will finally scale up to 10^9
N = 1000
M = 300

X = np.random.randint(M, size=N)
Y = np.random.randint(M, size=N)

# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)

# convert int32 arrays to str64 arrays, to represent original data (which are strings and not numbers)
X = np.char.mod('%d', X)
Y = np.char.mod('%d', Y)
Q = np.char.mod('%d', Q)
Z = np.char.mod('%d', Z)

matchlist = loop(X,Y,Q,Z)

Some details:

I have two arrays (X and Y) which are identical in length. Each row of these arrays corresponds to one DNA sequencing read (basically strings of the letters 'A','C','G','T'; details not relevant for the example code here).

I also have two 'reference arrays' (Q and Z) which are identical in length and I want to find the occurrence (with np.where()) of every element of X within Q, as well as of every element of Y within Z (basically the find_match() function). Afterwards I want to know whether there is an overlap/intersect between the indexes found for X and Y.

Example output (matchlist; some rows of X/Y have matching indexes in Q/Y, some don't e.g. index 11):

Example output of matchlist

The code works fine so far, but it would take quite long to execute with my final dataset where N=10^9 (in this code example N=1000 to make the tests quicker). 1000 rows of X/Y need about 2.29s to execute on my laptop: timeit test of loop()

Every find_match() takes about 2.48 ms to execute which is roughly 1/1000 of the final loop. timeit test of find_match()

One first approach would be to combine (x with y) as well as (q with z) and then I only need to run np.where() once, but I couldn't get that to work yet.

What I've tried so far:

  • loop and lookup within Pandas (.loc()) but this was about 4x slower than np.where() method

The question is closely related to a recent question from philshem (Combine Numpy "where" statements), however, the solutions provided on this question do not work for my approach here.

question from:https://stackoverflow.com/questions/65883951/increase-performance-of-np-where-loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Numpy isn't too helpful here, since what you need is a lookup into a jagged array, with strings as the indexes.

lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
    lookup.setdefault((q, z), []).append(i)

matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]

If you don't need the output as a jagged array, but are OK with just a boolean denoting presence, and can preprocess each string to a number, there is a much faster method.

lookup = np.zeros((300, 300), dtype=bool)
lookup[Q, Z] = True
matchlist = lookup[X, Y]

You typically won't want to use this method to replace the former jagged case, as dense variants (eg. Daniel F's solution) will be memory inefficient and numpy does not support sparse arrays well. However, if more speed is needed then a sparse solution is certainly possible.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...