In this chapter, we will call an accelerator a dedicated co-processor different from the main processor used to execute the calculation code. In the current version of Arcane, these are GPGPU type accelerators.
The Arcane API for managing accelerators is inspired by libraries such as RAJA or Kokkos but is restricted to the specific needs of Arcane.
The current implementation only supports NVIDIA graphics cards (via CUDA) or AMD (via ROCm) as accelerators.
The Arcane accelerator API meets the following objectives:
The operating principle is the execution of offloaded compute kernels. The code is executed by default on the CPU (the host) and certain parts of the calculation are offloaded to the accelerators. This offloading is done via specific calls.
To use the accelerators, it is necessary to have compiled Arcane with CUDA or ROCm. More information is in chapter Compilation.
All types used for accelerator management are in the Arcane::Accelerator namespace. There are two components for managing accelerators:
The main classes for managing accelerators are:
There are two ways to use accelerators in Arcane:
To run a calculation on an accelerator, you must instantiate an execution queue. The RunQueue class manages such a queue. The makeQueue() function allows creating such a queue. Execution queues can be temporary or persistent but cannot be copied. The makeQueueRef() method allows creating a reference to a queue that can be copied.
Any module can retrieve an implementation of the IAcceleratorMng interface via the method AbstractModule::acceleratorMng(). The following code example shows how to use accelerators from an entry point:
It is possible to create multiple instances of the Runner object.
An instance of this class is associated with an execution policy whose possible values are given by the enumeration eExecutionPolicy. By default, the execution policy is eExecutionPolicy::Sequential, which means that the compute kernels will be executed sequentially.
It is also possible to automatically initialize an instance of this class based on command-line arguments:
Arcane provides integration to compile with accelerator support via CMake. Those who use another build system must manage this support similarly.
To be able to use compute kernels on an accelerator, you generally need to use a specific compiler. For example, the current implementation of Arcane via CUDA uses NVIDIA's nvcc compiler for this. This compiler is responsible for compiling the part associated with the accelerator. The part associated with the CPU is compiled with the same compiler as the rest of the code.
It is necessary to specify in the CMakeLists.txt that you want to use accelerators as well as the files that will be compiled for the accelerators. Only files using commands (RUNCOMMAND_LOOP or RUNCOMMAND_ENUMERATE) need to be compiled for the accelerators. For this, Arcane defines the following CMake functions:
If Arcane is compiled in a CUDA environment, the CMake variable ARCANE_HAS_CUDA is defined. If Arcane is compiled in a HIP/ROCm environment, then ARCANE_HAS_HIP is defined.
The choice of the default execution environment (IAcceleratorMng::runner()) is determined by the command line:
Once you have an instance of RunQueue, it is possible to create a command that can be offloaded to the accelerator. Commands are always loops that can take the following forms:
Chapter Using lambdas describes the syntax of these loops.
The following code example shows how to use accelerators from an entry point:
Accelerators generally have their own memory, which is different from the host's memory. It is therefore necessary to specify how the data will be used to manage potential transfers between memories. For this, Arcane provides a mechanism called a view, which allows specifying for a variable or an array whether it will be used as input, output, or both.
Arcane offers views on variables (VariableRef) or on the NumArray class (The page Usage of the NumArray class describes the use of this class in more detail).
Regardless of the associated container, the declaration of views is the same and uses the methods viewIn(), viewOut() or viewInOut().
By default, Arcane uses the allocator returned by MeshUtils::getDefaultDataAllocator() for the NumArray type as well as all variables (VariableRef), entity groups (ItemGroup) and connectivities.
When using accelerators, Arcane requires that this allocator allocates memory that is accessible both on the host and the accelerator. This means that the data corresponding to these objects is accessible both on the host (CPU) and on the accelerators. For this, Arcane uses unified memory (eMemoryResource::UnifiedMemory) by default.
With unified memory, the accelerator automatically manages potential memory transfers between the accelerator and the host. These transfers can be time-consuming if they are frequent, but if a piece of data is only used on the CPU or on the accelerator, there will be no memory transfers and thus performance will not be impacted.
Starting from version 3.14.12 of Arcane, it is possible to change the default memory resource used via the environment variable ARCANE_DEFAULT_DATA_MEMORY_RESOURCE. On accelerators where the memory eMemoryResource::Device is directly accessible from the host (for example MI250X, MI300A, GH200), this allows avoiding transfers that unified memory might cause.
In all cases, it is possible to specify a specific allocator for UniqueArray and NumArray via the methods MemoryUtils::getAllocator() or MemoryUtils::getAllocationOptions().
Arcane provides mechanisms for providing information to optimize this memory management. These mechanisms depend on the accelerator type and may not be available everywhere. They are accessible via the method Runner::setMemoryAdvice().
Starting from version 3.10 of Arcane and with NVIDIA accelerators, Arcane offers features to detect memory transfers between the CPU and the accelerator. The page Integration with CUPTI (Cuda Profiling Tools Interface) describes this functionality.
The following example shows how to modify the iteration range so that it does not start from zero:
Regardless of the macro (RUNCOMMAND_ENUMERATE(), RUNCOMMAND_LOOP(), ...) used for the loop, the following code must be a C++11 lambda function. It is this lambda function that will eventually be offloaded to the accelerator.
Arcane uses the operator<< to "send" the loop to a command (RunCommand), which allows writing the code similarly to a classic C++ loop (or an ENUMERATE_() loop in the case of mesh entities) with the following few modifications:
For example:
When a computation kernel is offloaded to the accelerator, you must not access the memory associated with the views from another part of the code during execution, or it may crash. Generally, this can only happen when the RunQueue are asynchronous. For example:
The compilation mechanisms and memory management on accelerators impose restrictions on the use of classic C++ lambdas
In a lambda intended to be offloaded to the accelerator, you can only call:
It is not possible to call external functions defined in other compilation units (for example, other libraries)
You must not use a reference to a class field in lambdas because it is captured by reference. This will cause a crash due to invalid memory access on the accelerator. To avoid this problem, simply declare a local copy of the class instance value you wish to use within the function. In the following example, the function f1() will cause a crash while f2() will work correctly.
Starting from version 3.10, Arcane supports "Accelerator Aware" MPI libraries. In this case, the buffer used for variable synchronizations is allocated directly on the accelerator. If a variable is used on the accelerator, this avoids unnecessary copies between the host and the accelerator. Shared memory message exchange mode also supports this mechanism.
If problems occur, this support can be disabled by setting the environment variable ARCANE_DISABLE_ACCELERATOR_AWARE_MESSAGE_PASSING to a non-zero value.
Arcane associates an instance of Runner (accessible via ISubDomain::acceleratorMng()) when creating a subdomain. When a machine has multiple accelerators, Arcane by default chooses the first one returned in the available accelerators. This behavior can be changed by setting the environment variable ARCANE_ACCELERATOR_PARALLELMNG_RANK_FOR_DEVICE to a strictly positive value indicating the modulo between the subdomain rank (returned by IParallelMng::commRank() of ISubDomain::parallelMng()) and the accelerator index in the list of accelerators. For example, if this environment variable is 8, then the subdomain of rank N will be associated with the accelerator of index (N % 8). For this mechanism to work, the value of this environment variable must therefore be less than the number of accelerators available on the machine.
When multiple accelerators are available on the same machine, there is generally a "current" accelerator for each thread (for example, with CUDA it is possible to retrieve it using the cudaGetDevice() method and change it using the cudaSetDevice() method). When allocating memory on the accelerator, it is done on this "current" accelerator, and this memory will not be available on other accelerators. An instance of RunQueue is associated with a given accelerator, so you must ensure that the memory regions used by a command are accessible. If this is not the case, it will produce an error during execution (For example, with CUDA, this is error 400, whose message is "invalid resource handle").
If the "current" accelerator has been changed, for example, when calling an external library, it is possible to change it by calling the method Runner::setAsCurrentDevice().
Accessing mesh connectivity is done differently on the accelerator than on the CPU for performance reasons. Specifically, it is not possible to use classic entities (Cell,Node, ...). Instead, you must use local identifiers such as CellLocalId or NodeLocalId.
The UnstructuredMeshConnectivityView class allows access to connectivity information. It is possible to define an instance of this class and keep it during the calculation. To initialize the instance, you must call the method UnstructuredMeshConnectivityView::setMesh().
To access generic entity information, such as type or owner, you must use the ItemGenericInfoListView view.
The following example shows how to access cell nodes and mesh information. It iterates over all cells and calculates the barycenter for those that are in our subdomain and are hexahedrons.
The doAtomic method allows performing atomic operations. The supported operation types are defined by the eAtomicOperation enumeration. For example:
Arcane offers several classes for performing more advanced algorithms. On the accelerator, these algorithms generally use libraries provided by the constructor (CUB for NVIDIA and rocprim for AMD). The algorithms proposed by Arcane therefore have the same limitations as the underlying constructor implementation.
The available classes are:
It is possible to use Arcane's accelerator mode without support for high-level objects such as meshes or subdomains.
In this mode, it is possible to use the Arcane accelerator API directly from the main() function, for example. To use this mode, simply use the class method ArcaneLauncher::createStandaloneAcceleratorMng() after initializing Arcane:
The launcher instance must remain valid as long as you wish to use the accelerator API. It is therefore preferable to define it in the code's main(). The StandaloneAcceleratorMng class uses a reference semantics. Therefore, it is possible to keep a reference to the instance anywhere in the code if necessary.
The 'standalone_accelerator' example shows such usage. For example, the following code allows offloading the sum of two arrays a and b into an array c on the accelerator.