Vectorization is a mechanism that allows the same instruction to be executed on multiple data items. The English term commonly used to describe vectorization is Single Instruction Multiple Data (SIMD). Since this is an instruction managed directly by the processor, the possible operations are quite limited. Generally, these are basic arithmetic operations (addition, subtraction, ...) as well as classic mathematical functions (minimum, maximum, absolute value, ...). Complex mathematical operations (such as logarithm, exponential, ...) are generally not native vector instructions.
Recent processors all support vectorization. However, the vector sizes and possible operations differ from one processor to another.
For example, the following simple loop performs n additions:
With a scalar processor, the registers contain only one real number, and addition instructions therefore operate on only one real number. n addition instructions will be required to perform this calculation. A vector processor has registers containing multiple real numbers. For registers containing P real numbers, the number of addition instructions needed is therefore n/P. If scalar and vector instructions take the same amount of time, there is therefore a theoretical speedup factor of P. The larger the registers, the more important the potential benefit of vectorization is. Of course, in practice, it is often less rosy, and the real gain depends on other factors such as memory bandwidth, pipelining, etc.
To exploit vectorization, there are two possibilities (which are compatible):
The first solution is the simplest because it does not require changing the code. It is directly available via the correct compiler options. The downside of this simplicity is that it is often difficult for the compiler to detect where vectorization is possible. The generated code is therefore rarely vectorized. The second method guarantees the exploitation of vectorization but requires rewriting the code. Arcane provides a set of classes to exploit this second method.
The principle is therefore to provide a vector class corresponding to a scalar class. The vector class will therefore contain N scalar values, with N depending on the available vectorization type.
Even if theoretically vectorization can be applied to all simple types (short, int, long, float, ...), Arcane limits itself to providing classes that manage vectorization only for the Arcane::Real and derived types (Arcane::Real2, Arcane::Real3).
Currently, Arcane provides the following vector types:
| Scalar type | Vector type | Definition file |
|---|---|---|
| Arcane::Real | Arcane::SimdReal |
#include "arcane/utils/Simd.h"
|
| Arcane::Real2 | Arcane::SimdReal2 | |
| Arcane::Real3 | Arcane::SimdReal3 | |
| Arcane::Item, Arcane::Cell, Arcane::Face, ... | Arcane::SimdItem, Arcane::SimdCell, Arcane::SimdFace, ... |
#include "arcane/core/SimdItem.h"
|
Using SIMD classes is similar to scalar usage. It is generally sufficient to change the name of the scalar classes to the corresponding vector name.
The following example shows how to transition from scalar to vector writing:
Vectorization works well as long as all elements of the vector must perform the same operation. Things get complicated when this is no longer the case. Notably, anything that depends on a condition is difficult to vectorize. There are also cases where you want to perform specific operations for each element within a vector loop. To handle this situation, it is possible to add sequential sections by iterating over the entities of an Arcane::SimdItem using the ENUMERATE_*. macros. For example:
Finally, it is possible to know the number of reals in a vector register via the SimdReal::BLOCK_SIZE constant. This allows, for example, iterating over the elements of a vector register:
In general, and this is the case for x64 processors, using vectorization requires that the data in memory be aligned in a more restrictive way than for scalar types. For SSE, AVX, and AVX512, the minimum alignment is equal to the byte size of the Simd vector. So, for example, for AVX with 256-bit vectors, which is 32 bytes, the minimum alignment is 32 bytes. To simplify vectorization, Arcane guarantees that the following types have the desired minimum alignment for vectorization:
Since C++ does not allow allocation via new/delete with alignment, Arcane provides the Arccore::AlignedMemoryAllocator class which can be used with the Arcane::UniqueArray and Arcane::SharedArray classes to guarantee alignment. For example:
Vectorization works well when the number of loop elements is a multiple of the Simd vector size. If this is not the case, the last part of the loop must be handled in a certain way. To provide an identical mechanism for all vectorization types, Arcane duplicates the last valid value in the Simd vector. For example, suppose the following code:
With cells being a group of cells containing 11 elements. If we assume the vector size is 8, then the previous loop will perform two iterations. For the first, we will have the following values for simd_cell
For the second iteration, since cells only contains 11 elements, we repeat the last valid value in simd_cell:
This mechanism works partially as long as the operations performed are truly vectorizable. If this is not the case, it is possible to iterate only over the valid values as follows:
With the previous example, the inner loop will only perform 3 iterations, (for the cells cells[8], cells[9] and cells[10]) for the last part of cells.
The mathematical operations supported by Arcane's vector classes are defined in the SimdMathUtils.h file:
Arcane provides for the Arcane::SimdReal, Arcane::SimdReal2, and Arcane::SimdReal3 vector classes the same operations available in MathUtils.h for the scalar version, with the exception of min and max.
In version 2.2, Arcane only supports vectorization for x86 architecture processors.
For these processors, there are (currently) three generations of vectorization:
Depending on the platform, several mechanisms may be available. On Intel processors, processors have backward compatibility, so those that support AVX512 also support AVX and SSE. Similarly, processors with AVX support SSE.
Arcane defines the default mechanism as the one that uses the most advanced vectorization. The Arcane::SimdInfo, Arcane::SimdReal, Arcane::SimdReal3 types are therefore typedefs that depend on the platform.
Arcane also defines macros indicating the available mechanisms: