Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
352 views
in Technique[技术] by (71.8m points)

python - Numpy in-place operation performance

I was comparing numpy array in-place operation with regular operation. And here is what I did (Python version 3.7.3):

    a1, a2 = np.random.random((10,10)), np.random.random((10,10))

In order to make comparison:

    def func1(a1, a2):
        a1 = a1 + a2

    def func2(a1, a2):
        a1 += a2

%timeit func1(a1, a2)
%timeit func2(a1, a2)

Because in-place operation avoid the allocation of memory for each loop. I was expecting func1 to be slower than func2.

However I got this:

In [10]: %timeit func1(a1, a2)
595 ns ± 14.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [11]: %timeit func2(a1, a2)
1.38 μs ± 7.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [12]: np.__version__
Out[12]: '1.16.2'

Which suggests func1 is only 1/2 of the time took by func2. Can anyone help to explain why this is the case?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I found this very intriguing and decided to time this myself. But instead of just checking for 10x10 arrays I tested a lot of different array sizes with NumPy 1.16.2:

enter image description here

This clearly shows that for small array sizes the normal addition is faster and only for moderately large array sizes the in-place operation is faster. There is also a weird bump around 100000 elements that I cannot explain (it's close to the page size on my computer, maybe there a different allocation scheme is used).

Allocating a temporary array is expected to be slower because:

  • One has to allocate that memory
  • One has to iterate over 3 arrays do perform the operation instead of 2.

Especially the first point (allocating the memory) is probably not accounted for in the benchmark (not with %timeit not with the simple_benchmark.run). That's because requesting the same memory-size over and over again will be something that is probably very optimized. Which would make the addition with an extra array seem a bit faster than it actually is.

Another point to mention here is that in-place addition probably has a higher constant factor. If you're doing an in-place addition you have do to more code-checks before you can perform the operation, for example for overlapping inputs. That could give in-place addition a higher constant factor.

As a more general advise: Micro-benchmarks can be helpful but they are not always really accurate. You should also benchmark the code that calls it to make more educated statements about the actual performance of your code. Often such micro-benchmarks hit some highly optimized cases (for example repeatedly allocating the same amount of memory and releasing it again), that wouldn't happen (so often) when the code is actually used.

Here is also the code I used for the graph, using my library simple_benchmark:

from simple_benchmark import BenchmarkBuilder, MultiArgument
import numpy as np

b = BenchmarkBuilder()

@b.add_function()
def func1(a1, a2):
    a1 = a1 + a2

@b.add_function()
def func2(a1, a2):
    a1 += a2

@b.add_arguments('array size')
def argument_provider():
    for exp in range(3, 28):
        dim_size = int(1.4**exp)
        a1 = np.random.random([dim_size, dim_size])
        a2 = np.random.random([dim_size, dim_size])
        yield dim_size ** 2, MultiArgument([a1, a2])

r = b.run()
r.plot()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...