Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
454 views
in Technique[技术] by (71.8m points)

c++ - Parallel prefix sum - fastest Implementation

I want to implement the parallel prefix sum algorithm using C++. My program should take the input array x[1....N],and it should display the output in the array y[N]. (Note the maximum value of N is 1000.)

So far, I went through many research papers and even the algorithm in Wikipedia. But my program should also display the output, the steps and also the operations/instructions of each step.

I want the fastest implementation like I want to minimise the number of operations as well as the steps.

For example::

x = {1, 2, 3,  4,   5,   6,   7,  8 } - Input
y = ( 1, 3, 6, 10, 15, 21, 28, 36) - Output

But along with displaying the y array as output, my program should also display the operations of each step. I also refer this thread calculate prefix sum ,but could get much help from it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The answer to this question is here: Parallel Prefix Sum (Scan) with CUDA and here: Prefix Sums and Their Applications. The NVidia article provides the best possible implementation using CUDA GPUs, and the Carnegie Mellon University PDF paper explains the algorithm. I also implemented an O(n/p) prefix sum using MPI, which you can find here: In my github repo.

This is the pseudocode for the generic algorithm (platform independent):

Example 3. The Up-Sweep (Reduce) Phase of a Work-Efficient Sum Scan Algorithm (After Blelloch 1990)

 for d = 0 to log2(n) – 1 do 
      for all k = 0 to n – 1 by 2^(d+1) in parallel do 
           x[k +  2^(d+1) – 1] = x[k +  2^d  – 1] + x[k +  2^(d+1) – 1]

Example 4. The Down-Sweep Phase of a Work-Efficient Parallel Sum Scan Algorithm (After Blelloch 1990)

 x[n – 1] = 0
 for d = log2(n) – 1 down to 0 do 
       for all k = 0 to n – 1 by 2^(d+1) in parallel do 
            t = x[k +  2^d  – 1]
            x[k +  2^d  – 1] = x[k +  2^(d+1) – 1]
            x[k +  2^(d+1) – 1] = t +  x[k +  2^(d+1) – 1]

Where x is the input data, n is the size of the input and d is the degree of parallelism (number of CPUs). This is a shared memory computation model, if it used distributed memory you'd need to add communication steps to that code, as I did in the provided Github example.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...