python - Why are multiprocessing.sharedctypes assignments so slow?

Question

Welcome To Ask or Share your Answers For Others

python - Why are multiprocessing.sharedctypes assignments so slow?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Why are multiprocessing.sharedctypes assignments so slow?

Here's a little bench-marking code to illustrate my question:

import numpy as np
import multiprocessing as mp
# allocate memory
%time temp = mp.RawArray(np.ctypeslib.ctypes.c_uint16, int(1e8))
Wall time: 46.8 ms
# assign memory, very slow
%time temp[:] = np.arange(1e8, dtype = np.uint16)
Wall time: 10.3 s
# equivalent numpy assignment, 100X faster
%time a = np.arange(1e8, dtype = np.uint16)
Wall time: 111 ms

Basically I want a numpy array to be shared between multiple processes because it's big and read-only. This method works great, no extra copies are made and the actual computation time on the processes is good. But the overhead of creating the shared array is immense.

This post offered some great insight into why certain ways of initializing the array are slow (note that in the example above I'm using the faster method). But the post doesn't really describe how to really improve the speed to numpy like performance.

Does anyone have any suggestions on how to improve the speed? Would some cython code make sense to allocate the array?

I'm working on a Windows 7 x64 system.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:18:12+0000

This is slow for the reasons given in your second link, and the solution is actually pretty simple: Bypass the (slow) RawArray slice assignment code, which in this case is inefficiently reading one raw C value at a time from the source array to create a Python object, then converts it straight back to raw C for storage in the shared array, then discards the temporary Python object, and repeats 1e8 times.

But you don't need to do it that way; like most C level things, RawArray implements the buffer protocol, which means you can convert it to a memoryview, a view of the underlying raw memory that implements most operations in C-like ways, using raw memory operations if possible. So instead of doing:

# assign memory, very slow
%time temp[:] = np.arange(1e8, dtype = np.uint16)
Wall time: 9.75 s  # Updated to what my machine took, for valid comparison

use memoryview to manipulate it as a raw bytes-like object and assign that way (np.arange already implements the buffer protocol, and memoryview's slice assignment operator seamlessly uses it):

# C-like memcpy effectively, very fast
%time memoryview(temp)[:] = np.arange(1e8, dtype = np.uint16)
Wall time: 74.4 ms  # Takes 0.76% of original time!!!

Note, the time for the latter is milliseconds, not seconds; copying using memoryview wrapping to perform raw memory transfers takes less than 1% of the time to do it the plodding way RawArray does it by default!

Categories

python - Why are multiprocessing.sharedctypes assignments so slow?

python - Why are multiprocessing.sharedctypes assignments so slow?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags