I just got a "good answer" badge two years after my initial answer. Thanks for acknowledging the quality of this answer. In return, I would substantially enrich the original contents. This answer / article is now intended for any R user that wants to try out SIMD. It has the following sections:
- Background: what is SIMD?
- Summary: how can we leverage SIMD for our assembly or compiled code?
- R side: how can R possibly use SIMD?
- Performance: is vector code always faster than scalar code?
- Writing R extensions: write compiled code with OpenMP SIMD
Background: what is SIMD?
Many R programmers may not know SIMD if they don't write assembly or compiled code.
SIMD (single instruction, multiple data) is a data-level parallel processing technology that has a very long history. Before personal computers were around, SIMD unambiguously referred to the vector processing in a vector processor, and was the main route to high-performance computing. When personal computers later came into the market, they didn't have any features resembling those of a vector processor. However, as the demands for processing multi-media data grew higher and higher, they began to have vector registers and corresponding sets of vector instructions to use those registers for vector data load, vector data arithmetics and vector data store. The capacity of vector registers is getting bigger, and the functionality of vector instruction sets are also increasingly versatile. Till today, they are able to do stream load / store, strided load / store, scattered load / store, vector elements shuffling, vector arithmetics (including fused arithmetics like fused multiply-add), vector logical operations, masking, etc. So they are more and more alike a mini vector processors of the old days.
Although SIMD has been with personal computers for nearly two decades, many programmers are unaware of it. Instead, many are familiar with thread-level parallelism like multi-core computing (which can be referred to as MIMD). So if you are new to SIMD, you are highly recommended to watch this YouTube video Utilizing the other 80% of your system's performance: Starting with Vectorization by Ulrich Drepper from Red Hat Linux.
Since vector instruction sets are extensions to the original architecture instruction sets, you have to invest extra efforts to use them. If you are to write assembly code, you can call these instructions straightaway. If you are to write compiled code like C, C++ and Fortran, you have to write inline assembly or use vector intrinsics. A vector intrinsic appears like a function, but it is in fact an inline assembly mapping to a vector assembly instruction. These intrinsics (or "functions") are not part of standard libraries of a compiled language; they are provided by the architecture / machine. To use them, we need including appropriate header files and compiler-specific flags.
Let's first define the following for ease of later discussions:
- Writing assembly or compiled code that does not use vector instruction sets is called "writing scalar code";
- Writing assembly or compiled code that uses vector instruction sets is called "writing vector code".
So these two paths are "write A, get A" and "write B, get B". However, compilers are getting stronger and there is another "write A, get B" path:
- They have a power to translate your written scalar compiled code to vector assembly code, a compiler optimization called "auto-vectorization".
Some compilers like GCC considers auto-vectorization as part of highest-level optimization, and is enabled by flag -O3
; while other more aggressive compiles like ICC (intel C++ compiler) and clang would enable it at -O2
. Auto-vectorization can also be directly controlled by specific flags. For example, GCC has -ftree-vectorize
. When exploiting auto-vectorization, it is advised to further hint compilers to taylor vector assembly code for machine. For example, for GCC we may do -march=native
and for ICC we use -xHost
. This makes sense, because even on the x86-64 architecture family, later microarchitectures come with more vector instruction sets. For example, sandybridge supports vector instruction sets up to AVX, haswell further supports AVX2 and FMA3 and skylake further supports AVX-512. Without -march=native
, GCC only generates vector instructions using instruction sets up to SSE2, which is a much smaller subset common to all x86-64.
Summary: how can we leverage SIMD for our assembly or compiled code?
There are five ways to implement SIMD:
Writing machine-specific vector assembly code directly. For example, on x86-64 we use SSE / AVX instruction sets, and on ARM architectures we use NEON instruction sets.
- Pros: Code can be hand-tuned for better performance;
- Cons: We have to write different versions of assembly code for different machines.
Writing vector compiled code by using machine-specific vector intrinsics and compiling it with compiler-specific flags. For example, on x86-64 we use SSE / AVX intrinsics, and for GCC we set -msse2
, -mavx
, etc (or simply -march=native
for auto-detection). A variant of this option is to write compiler-specific inline-assembly. For example, introduction to GCC's inline assembly can be found here.
- Pros: Writing compiled code is easier than writing assembly code, and the code is more readable hence easier to maintain;
- Cons: We have to write different versions of code for different machines, and adapt Makefile for different compilers.
Writing vector compiled code by using compiler-specific vector extensions. Some compilers have defined their own vector data type. For example, GCC's vector extensions can be found here;
- Pros: We don't need to worry about the difference across architectures, as compiler can generate machine-specific assembly;
- Cons: We have to write different versions of code for different compilers, and adapt Makefile likewise.
Writing scalar compiled code and using compiler-specific flags for auto-vectorization. Optionally we can insert compiler-specific pragmas along our compiled code to give compilers more hints on, say, data alignment, loop unrolling depth, etc.
- Pros: Writing scalar compiled code is easier than writing vector compiled code, and is more readable to broad audience.
- Cons: We have to adapt Makefile for different compilers, and in case we have used pragmas, they also need be versioned.
writing scalar compiled code and inserting OpenMP pragmas (requiring OpenMP 4.0+) #pragma opm simd
.
- Pros: same as option 4, and additionaly, we can use a single version of pragmas as many main stream compilers support OpenMP standard;
- Cons: We have to adapt Makefile for different compilers as they may have different flags to enable OpenMP and machine-specific tuning.
From top to bottom, programmers progressively do less and compilers do increasingly more. Implementing SIMD is interesting, but this article unfortunately does not have the room for a decent coverage with examples. I would provide the most informative references I found.
For options 1 and 2 on x86-64, SSE / AVX intrinsics is definitely the best reference card, but not the right place to start learning these instructions. Where to start is individual-specific. I picked up intrinsics / assembly from BLISLab when I tried to write my own high-performance DGEMM (to be introduced later). After digesting example code over there I started practising, and posted a few questions on StackOverflow or CodeReview when I got stucked.
For options 4, a good explanation is given by A guide to auto-vectorization with intel C++ compilers. Although the manual is for ICC, the principle of how auto-vectorization works applies to GCC as well. The official website for GCC's auto-vectorization is so out-dated, and this presentation slide is more useful: GCC auto-vectorization.
For option 5, there is a very good technical report by Oak Ridge National Laboratory: Effective Vectorization with OpenMP 4.5.
In terms of portability,
Options 1 to 3 are not easily portable, because the version of vector code depends on machine and / or compiler;
Option 4 is much better as we get rid of machine-dependency, but we still have problem with compiler-dependency;
Option 5 is very close to portable, as adapting Makefile is way much easier than adapting code.
In terms of performance, conventionally it is believed that option 1 is the best, and performance would degrade as we move downward. However, compilers are getting better, and newer machines have hardware improvement (for example, the performance penalty for unaligned vector load is smaller). So auto-vectorization is very positive. As part of my own DGEMM case study, I found that on an Intel Xeon E5-2650 v2 workstation with a peak performance of 18 GFLOPs per CPU, GCC's auto-vectorization has attained 14 ~ 15 GFLOPs which is rather impressive.
R side: how can R possibly use SIMD?
R can only use SIMD by calling compiled code that exploits SIMD. Compiled code in R has three sources:
- R packages with "base" priority, like
base
, stats
, utils
, etc, that come with R's source code;
- Other packages on CRAN that require compilation;
- External scientific libraries like BLAS and LAPACK.
Since R software itself is portable across architectures, platforms and operating systems, and CRAN policy expects that an R package to be equally portable, compiled code in sources 1 and 2 can not be written in assembly code or compiler-dependent compiled code, ruling out options 1 to 3 for SIMD implementation. Auto-vectorization is the only chance left for R to leverage SIMD.
If you have built R with compiler's auto-vectorization enabled, compiled code from sources 1 can exploit SIMD. In fact, although R is written in a portable way, you can tune it for your machine when building it. For example, I would do icc -xHost