How does OpenMP uses atomic instruction inside reduction? Doesn't it
rely on atomic at all?
Since the OpenMP standard does not specify how the reduction
clause should (or not) be implemented (e.g., based on atomic
operations or not), its implementation may vary depending on each concrete implementation of the OpenMP standard.
For instance, is the variable sum in the code below accumulated with
atomic + operator?
Nonetheless, from the OpenMP standard, one can read the following:
The reduction clause can be used to perform some forms of recurrence
calculations (...) in parallel. For parallel and work-sharing constructs, a
private copy of each list item is created, one for each implicit task,
as if the private clause had been used. (...) The private copy is
then initialized as specified above. At the end of the region for
which the reduction clause was specified, the original list item is
updated by combining its original value with the final value of each
of the private copies, using the combiner of the specified
reduction-identifier.
So based on that, one can infer that the variables used on the reduction clause will be private, and consequently, will not be updated atomically. Notwithstanding, even if that was not the case it would be unlikely, though, that a concrete implementation of the OpenMP standard would rely on the atomic
operation (for the instruction sum += v[i];
) since (in this case) would not be the most efficient strategy. For more information on why is that the case check the following SO threads:
- Why my parallel code using openMP atomic takes a longer time than serial code?;
- Why should I use a reduction rather than an atomic variable?.
Very informally, a more efficient approach than using atomic
would be for each thread to have their own copy of the variable sum
, and at the end of the parallel region
, each thread would save its copy into a resource shared among threads -- now, depending on how the reduction is implemented, atomic
operations might be used to update that shared resource. That resource would then be picked up by the master thread that would reduce its content and update the original sum
variable, accordingly.
More formally from OpenMP Reductions Under the Hood:
After having revisited parallel reductions in detail you might still
have some open questions about how OpenMP actually transforms your
sequential code into parallel code. In particular, you might wonder
how OpenMP detects the portion in the body of the loop that performs
the reduction. As an example, this or a similar code fragment can
often be found in code samples:
#pragma omp parallel for reduction(+:x)
for (int i = 0; i < n; i++)
x -= some_value;
You could also use - as reduction operator (which is actually
redundant to +). But how does OpenMP isolate the
update step x-= some_value? The discomforting answer is that OpenMP
does not detect the update at all! The compiler treats the body of the
for-loop like this:
#pragma omp parallel for reduction(+:x)
for (int i = 0; i < n; i++)
x = some_expression_involving_x_or_not(x);
As a result, the modification of x could also be hidden behind an opaque > function call.
This is a comprehensible decision from the point of view of a compiler
developer. Unfortunately, this means that you have to ensure that all
updates of x are compatible with the operation defined in the
reduction clause.
The overall execution flow of a reduction can be summarized as
follows:
- Spawn a team of threads and determine the set of iterations that each thread j has to perform.
- Each thread declares a privatized variant of the reduction variable x initialized with the neutral element e of the corresponding
monoid.
- All threads perform their iterations no matter whether or how they involve an update of the privatized variable .
- The result is computed as sequential reduction over the (local) partial results and the global variable x. Finally, the result is
written back to x.