Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
752 views
in Technique[技术] by (71.8m points)

python - Pytorch speed comparison - GPU slower than CPU

I was trying to find out if GPU tensor operations are actually faster than CPU ones. So, I wrote this particular code below to implement a simple 2D addition of CPU tensors and GPU cuda tensors successively to see the speed difference:

import torch
import time

###CPU
start_time = time.time()
a = torch.ones(4,4)
for _ in range(1000000):
    a += a
elapsed_time = time.time() - start_time

print('CPU time = ',elapsed_time)

###GPU
start_time = time.time()
b = torch.ones(4,4).cuda()
for _ in range(1000000):
    b += b
elapsed_time = time.time() - start_time

print('GPU time = ',elapsed_time)

To my surprise, the CPU time was 0.93 sec and the GPU time was as high as 63 seconds. Am I doing the cuda tensor operation properly or is the concept of cuda tensors works faster only in very highly complex operations, like in neural networks?

Note: My GPU is NVIDIA 940MX and torch.cuda.is_available() call returns True.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

GPU acceleration works by heavy parallelization of computation. On a GPU you have a huge amount of cores, each of them is not very powerful, but the huge amount of cores here matters.

Frameworks like PyTorch do their to make it possible to compute as much as possible in parallel. In general matrix operations are very well suited for parallelization, but still it isn't always possible to parallelize computation!

In your example you have a loop:

b = torch.ones(4,4).cuda()
for _ in range(1000000):
    b += b

You have 1000000 operations, but due to the structure of the code it impossible to parallelize much of these computations. If you think about it, to compute the next b you need to know the value of the previous (or current) b.

So you have 1000000 operations, but each of these has to be computed one after another. Possible parallelization is limited to the size of your tensor. This size though is not very large in your example:

torch.ones(4,4)

So you only can parallelize 16 operations (additions) per iteration. As the CPU has few, but much more powerful cores, it is just much faster for the given example!

But things change if you change the size of the tensor, then PyTorch is able to parallelize much more of the overall computation. I changed the iterations to 1000 (because I did not want to wait so long :), but you can put in any value you like, the relation between CPU and GPU should stay the same.

Here are the results for different tensor sizes:

#torch.ones(4,4)       - the size you used
CPU time =  0.00926661491394043
GPU time =  0.0431208610534668

#torch.ones(40,40)     - CPU gets slower, but still faster than GPU
CPU time =  0.014729976654052734
GPU time =  0.04474186897277832

#torch.ones(400,400)   - CPU now much slower than GPU
CPU time =  0.9702610969543457
GPU time =  0.04415607452392578

#torch.ones(4000,4000) - GPU much faster then CPU 
CPU time =  38.088677167892456
GPU time =  0.044649362564086914

So as you see, where it is possible to parallelize stuff (here the addition of the tensor elements), GPU becomes very powerful.
GPU time is not changing at all for the given calculations, the GPU can handle much more!
(as long as it doesn't run out of memory :)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...