c++ - Parallel prefix sum - fastest Implementation

Question

Welcome To Ask or Share your Answers For Others

c++ - Parallel prefix sum - fastest Implementation

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - Parallel prefix sum - fastest Implementation

I want to implement the parallel prefix sum algorithm using C++. My program should take the input array x[1....N],and it should display the output in the array y[N]. (Note the maximum value of N is 1000.)

So far, I went through many research papers and even the algorithm in Wikipedia. But my program should also display the output, the steps and also the operations/instructions of each step.

I want the fastest implementation like I want to minimise the number of operations as well as the steps.

For example::

x = {1, 2, 3,  4,   5,   6,   7,  8 } - Input
y = ( 1, 3, 6, 10, 15, 21, 28, 36) - Output

But along with displaying the y array as output, my program should also display the operations of each step. I also refer this thread calculate prefix sum ,but could get much help from it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:48:29+0000

The answer to this question is here: Parallel Prefix Sum (Scan) with CUDA and here: Prefix Sums and Their Applications. The NVidia article provides the best possible implementation using CUDA GPUs, and the Carnegie Mellon University PDF paper explains the algorithm. I also implemented an O(n/p) prefix sum using MPI, which you can find here: In my github repo.

This is the pseudocode for the generic algorithm (platform independent):

Example 3. The Up-Sweep (Reduce) Phase of a Work-Efficient Sum Scan Algorithm (After Blelloch 1990)

 for d = 0 to log2(n) – 1 do 
      for all k = 0 to n – 1 by 2^(d+1) in parallel do 
           x[k +  2^(d+1) – 1] = x[k +  2^d  – 1] + x[k +  2^(d+1) – 1]

Example 4. The Down-Sweep Phase of a Work-Efficient Parallel Sum Scan Algorithm (After Blelloch 1990)

 x[n – 1] = 0
 for d = log2(n) – 1 down to 0 do 
       for all k = 0 to n – 1 by 2^(d+1) in parallel do 
            t = x[k +  2^d  – 1]
            x[k +  2^d  – 1] = x[k +  2^(d+1) – 1]
            x[k +  2^(d+1) – 1] = t +  x[k +  2^(d+1) – 1]

Where x is the input data, n is the size of the input and d is the degree of parallelism (number of CPUs). This is a shared memory computation model, if it used distributed memory you'd need to add communication steps to that code, as I did in the provided Github example.

Categories

c++ - Parallel prefix sum - fastest Implementation

c++ - Parallel prefix sum - fastest Implementation

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags