Arcane  4.1.12.0
User documentation
Loading...
Searching...
No Matches
Utilization of Accelerators (GPU)

In this chapter, we will call an accelerator a dedicated co-processor different from the main processor used to execute the calculation code. In the current version of Arcane, these are GPGPU type accelerators.

The Arcane API for managing accelerators is inspired by libraries such as RAJA or Kokkos but is restricted to the specific needs of Arcane.

Note
The Arcane accelerator API can be used independently of the mechanisms associated with simulation codes such as modules, mesh, or services. For an example of standalone operation, refer to chapter Standalone Accelerator Mode.

The current implementation only supports NVIDIA graphics cards (via CUDA) or AMD (via ROCm) as accelerators.

The Arcane accelerator API meets the following objectives:

  • unify the behavior between sequential CPU, multi-threaded CPU, and accelerator.
  • have a single executable and be able to dynamically choose where the code will be executed: CPU or accelerator (or both at once).
  • have source code independent of the compiler, so we do not use mechanisms such as #pragma as in OpenMP or OpenACC standards.
Note
If you wish to use Arcane on both GPU and CPU for the CUDA environment, it is strongly recommended to use clang as the compiler instead of nvcc because the latter generates less performant code on the CPU side. This is due to the use of std::function to encapsulate the lambdas used in Arcane (see New Compiler Features in CUDA 8 for more information)

The operating principle is the execution of offloaded compute kernels. The code is executed by default on the CPU (the host) and certain parts of the calculation are offloaded to the accelerators. This offloading is done via specific calls.

To use the accelerators, it is necessary to have compiled Arcane with CUDA or ROCm. More information is in chapter Compilation.

Usage in Arcane

All types used for accelerator management are in the Arcane::Accelerator namespace. There are two components for managing accelerators:

  • arcane_accelerator_core whose header files are included via #include <arcane/accelerator/core>. This component contains classes independent of the accelerator type.
  • arcane_accelerator whose header files are included via #include <arcane/accelerator>. This component contains classes that allow offloading compute kernels to a specific accelerator.

The main classes for managing accelerators are:

  • IAcceleratorMng which allows access to the default execution environment.
  • Runner which represents an execution environment
  • RunQueue which represents an execution queue
  • RunCommand which represents a command (a compute kernel) associated with an execution queue.

There are two ways to use accelerators in Arcane:

To run a calculation on an accelerator, you must instantiate an execution queue. The RunQueue class manages such a queue. The makeQueue() function allows creating such a queue. Execution queues can be temporary or persistent but cannot be copied. The makeQueueRef() method allows creating a reference to a queue that can be copied.

Note
By default, creating a RunQueue from an Runner is not thread-safe for performance reasons. If you want to be able to launch multiple execution queues from the same Runner instance, you must call the method Runner::setConcurrentQueueCreation(true) beforehand

Usage in modules

Any module can retrieve an implementation of the IAcceleratorMng interface via the method AbstractModule::acceleratorMng(). The following code example shows how to use accelerators from an entry point:

// File to include all the time
#include "arcane/accelerator/core/IAcceleratorMng.h"
#include "arcane/accelerator/core/RunQueue.h"
// File to include to have RUNCOMMAND_ENUMERATE
// File to include to have RUNCOMMAND_LOOP
#include "arcane/accelerator/RunCommandLoop.h"
using namespace Arcane;
using namespace Arcane::Accelerator;
class MyModule
{
public:
void myEntryPoint()
{
// Loop over cells offloaded to accelerator
auto command1 = makeCommand(queue);
command1 << RUNCOMMAND_ENUMERATE(Cell,vi,allCells()){
};
// Classic 1D loop offloaded to accelerator
auto command2 = makeCommand(queue)
command2 << RUNCOMMAND_LOOP1(iter,5){
};
}
};
Types and macros to manage enumerations of entities on accelerators.
#define RUNCOMMAND_ENUMERATE(ItemTypeName, iter_name, item_group,...)
Macro to iterate over an accelerator on a group of entities.
#define RUNCOMMAND_LOOP1(iter_name, x1,...)
1D loop on accelerator with additional arguments.
IAcceleratorMng * acceleratorMng() const override
Accelerator manager.
virtual RunQueue queue()=0
Run queue associated with the instance.
Basic module.
Definition BasicModule.h:42
Cell of a mesh.
Definition Item.h:1300
CellGroup allCells() const
Returns the group containing all cells.
Namespace for accelerator usage.
RunCommand makeCommand(const RunQueue &run_queue)
Creates a command associated with the queue run_queue.
-- tab-width: 2; indent-tabs-mode: nil; coding: utf-8-with-signature --

Specific Runner Instance

It is possible to create multiple instances of the Runner object.

An instance of this class is associated with an execution policy whose possible values are given by the enumeration eExecutionPolicy. By default, the execution policy is eExecutionPolicy::Sequential, which means that the compute kernels will be executed sequentially.

Note
When creating an instance of Runner on an accelerator, it is possible to specify an accelerator other than the default accelerator (if multiple are available). This significantly complicates memory management. Chapter Multi-accelerator Management explains how to handle this.

It is also possible to automatically initialize an instance of this class based on command-line arguments:

#include "arcane/accelerator/RunQueue.h"
using namespace Arcane;
using namespace Arcane::Accelerator;
Runner runner;
ITraceMng* tm = ...;
IApplication* app = ...;
initializeRunner(runner,tm,app->acceleratorRuntimeInitialisationInfo());
Application interface.
void initializeRunner(Runner &runner, ITraceMng *tm, const AcceleratorRuntimeInitialisationInfo &acc_info)
Initializes runner based on the value of acc_info.

Compilation

Arcane provides integration to compile with accelerator support via CMake. Those who use another build system must manage this support similarly.

To be able to use compute kernels on an accelerator, you generally need to use a specific compiler. For example, the current implementation of Arcane via CUDA uses NVIDIA's nvcc compiler for this. This compiler is responsible for compiling the part associated with the accelerator. The part associated with the CPU is compiled with the same compiler as the rest of the code.

It is necessary to specify in the CMakeLists.txt that you want to use accelerators as well as the files that will be compiled for the accelerators. Only files using commands (RUNCOMMAND_LOOP or RUNCOMMAND_ENUMERATE) need to be compiled for the accelerators. For this, Arcane defines the following CMake functions:

  • arcane_accelerator_enable() which must be called before other functions to detect the compiler environment for the accelerator
  • arcane_accelerator_add_source_files(file1.cc [file2.cc] ...) to indicate the source files that must be compiled on accelerators
  • arcane_accelerator_add_to_target(mytarget) to indicate that the target mytarget requires the accelerator environment.

If Arcane is compiled in a CUDA environment, the CMake variable ARCANE_HAS_CUDA is defined. If Arcane is compiled in a HIP/ROCm environment, then ARCANE_HAS_HIP is defined.

Execution

The choice of the default execution environment (IAcceleratorMng::runner()) is determined by the command line:

  • If the AcceleratorRuntime option is specified, that runtime is used. Currently, the only possible values are cuda or hip. For example:
    MyExec -A,AcceleratorRuntime=cuda data.arc
  • Otherwise, if multi-threading is enabled via the -T option (see Launching a Calculation), then the compute kernels are distributed across multiple threads,
  • Otherwise, the compute kernels are executed sequentially.

Compute Kernels (RunCommand)

Once you have an instance of RunQueue, it is possible to create a command that can be offloaded to the accelerator. Commands are always loops that can take the following forms:

Chapter Using lambdas describes the syntax of these loops.

The following code example shows how to use accelerators from an entry point:

// Files to include all the time
#include "arcane/accelerator/core/IAcceleratorMng.h"
#include "arcane/accelerator/core/RunQueue.h"
// File to include to have RUNCOMMAND_ENUMERATE
// File to include to have RUNCOMMAND_LOOP
#include "arcane/accelerator/RunCommandLoop.h"
using namespace Arcane;
using namespace Arcane::Accelerator;
class MyModule
{
public:
void myEntryPoint()
{
RunQueue queue = ...;
// Loop over cells offloaded to accelerator
auto command1 = makeCommand(queue);
command1 << RUNCOMMAND_ENUMERATE(Cell,vi,allCells()){
};
// Classic 1D loop offloaded to accelerator
auto command2 = makeCommand(queue)
command2 << RUNCOMMAND_LOOP1(iter,5){
};
}
};

Usage of Views

Accelerators generally have their own memory, which is different from the host's memory. It is therefore necessary to specify how the data will be used to manage potential transfers between memories. For this, Arcane provides a mechanism called a view, which allows specifying for a variable or an array whether it will be used as input, output, or both.

Warning
A view is a TEMPORARY object and is always associated with a command (RunCommand) and a container (Arcane Variable or array) and must not be used when the associated command is finished or the associated container is modified.

Arcane offers views on variables (VariableRef) or on the NumArray class (The page Usage of the NumArray class describes the use of this class in more detail).

Regardless of the associated container, the declaration of views is the same and uses the methods viewIn(), viewOut() or viewInOut().

// To have NumArray
#include "arcane/utils/NumArray.h"
// To have views on variables
// To have views on NumArray
#include "arcane/accelerator/NumArrayViews.h"
// 1D arrays
// 1D variable on cells
VariableCellReal var_c = ...;
// Input view (read-only)
auto in_a = viewIn(command,a);
// Input/output view
auto inout_b = viewInOut(command,b);
// Output view (write-only) on the variable 'var_c'
auto out_c = viewOut(command,var_c);
Multi-dimensional arrays for numerical types accessible on accelerators.
MeshVariableScalarRefT< Cell, Real > VariableCellReal
Real type quantity at cell center.
auto viewOut(MeshVariableScalarRefT< ItemType, DataType > &var)
Write view.
auto viewIn(const MeshVariableScalarRefT< ItemType, DataType > &var)
Read view.
auto viewInOut(MeshVariableScalarRefT< ItemType, DataType > &var)
Read/write view.

Memory Management of Data Managed by Arcane

By default, Arcane uses the allocator returned by MeshUtils::getDefaultDataAllocator() for the NumArray type as well as all variables (VariableRef), entity groups (ItemGroup) and connectivities.

When using accelerators, Arcane requires that this allocator allocates memory that is accessible both on the host and the accelerator. This means that the data corresponding to these objects is accessible both on the host (CPU) and on the accelerators. For this, Arcane uses unified memory (eMemoryResource::UnifiedMemory) by default.

With unified memory, the accelerator automatically manages potential memory transfers between the accelerator and the host. These transfers can be time-consuming if they are frequent, but if a piece of data is only used on the CPU or on the accelerator, there will be no memory transfers and thus performance will not be impacted.

Starting from version 3.14.12 of Arcane, it is possible to change the default memory resource used via the environment variable ARCANE_DEFAULT_DATA_MEMORY_RESOURCE. On accelerators where the memory eMemoryResource::Device is directly accessible from the host (for example MI250X, MI300A, GH200), this allows avoiding transfers that unified memory might cause.

In all cases, it is possible to specify a specific allocator for UniqueArray and NumArray via the methods MemoryUtils::getAllocator() or MemoryUtils::getAllocationOptions().

Arcane provides mechanisms for providing information to optimize this memory management. These mechanisms depend on the accelerator type and may not be available everywhere. They are accessible via the method Runner::setMemoryAdvice().

Starting from version 3.10 of Arcane and with NVIDIA accelerators, Arcane offers features to detect memory transfers between the CPU and the accelerator. The page Integration with CUPTI (Cuda Profiling Tools Interface) describes this functionality.

Example of using a complex loop

The following example shows how to modify the iteration range so that it does not start from zero:

using namespace Arcane;
using namespace Arcane::Accelerator;
{
auto queue = makeQueue(runner);
auto command = makeCommand(queue);
auto out_t1 = viewOut(command,t1);
Int64 base = 300;
Int64 s1 = 400;
auto b = makeLoopRanges({base,s1},n2,n3,n4);
command << RUNCOMMAND_LOOP(iter,b)
{
auto [i, j, k, l] = iter();
out_t1(i,j,k,l) = _getValue(i,j,k,l);
};
}
#define RUNCOMMAND_LOOP(iter_name, bounds,...)
Loop on accelerator.
RunQueue makeQueue(const Runner &runner)
Creates a queue associated with runner.
SimpleForLoopRanges< 1 > makeLoopRanges(Int32 n1)
Creates an iteration range [0,n1[, [0,n2[.

Using lambdas

Regardless of the macro (RUNCOMMAND_ENUMERATE(), RUNCOMMAND_LOOP(), ...) used for the loop, the following code must be a C++11 lambda function. It is this lambda function that will eventually be offloaded to the accelerator.

Arcane uses the operator<< to "send" the loop to a command (RunCommand), which allows writing the code similarly to a classic C++ loop (or an ENUMERATE_() loop in the case of mesh entities) with the following few modifications:

  • curly braces ({ and }) are mandatory
  • a ; must be added after the last brace.
  • the body of a lambda is a function, not a loop. Consequently, it is not possible to use keywords such as continue or break. The keyword return is available and therefore will have the same effect as continue in a loop.

For example:

// 1D loop of 'nb_value' with 'iter' the iterator
command << RUNCOMMAND_LOOP1(iter,nb_value)
{
// Code executed on accelerator
};
// Loop over the cells of the group 'my_group' with 'cid' the index of
// the current cell (of type Arcane::CellLocalId)
command << RUNCOMMAND_ENUMERATE(Cell,icell,my_group)
{
// Code executed on accelerator
};

When a computation kernel is offloaded to the accelerator, you must not access the memory associated with the views from another part of the code during execution, or it may crash. Generally, this can only happen when the RunQueue are asynchronous. For example:

#include "arcane/accelerator/Views.h"
using namespace Arcane::Accelerator;
queue.setAsync(true);
auto in_a = viewIn(command,a);
auto out_b = viewOut(command,b);
// Copy A into B
command << RUNCOMMAND_LOOP1(iter,nb_value)
{
auto [i] = iter();
out_b(i) = in_a(i);
};
// The command is running as long as the barrier() method
// has not been called
// HERE you MUST NOT use 'a' or 'b' or 'in_a' or 'out_b'
queue.barrier();
// HERE you can use 'a' or 'b' (BUT NOT 'in_a' or 'out_b' because the
// command is finished)
void setAsync(bool v)
Sets the instance's asynchronous state.
Definition RunQueue.cc:299
void barrier() const
Blocks until all commands associated with the queue are finished.
Definition RunQueue.cc:159

Limitation of C++ lambdas on accelerators

The compilation mechanisms and memory management on accelerators impose restrictions on the use of classic C++ lambdas

Calling other functions in lambdas

In a lambda intended to be offloaded to the accelerator, you can only call:

  • class methods that are public
  • functions that are inline
  • functions or methods that have the ARCCORE_HOST_DEVICE or ARCCORE_DEVICE attribute or constexpr methods

It is not possible to call external functions defined in other compilation units (for example, other libraries)

Using fields of a class instance

You must not use a reference to a class field in lambdas because it is captured by reference. This will cause a crash due to invalid memory access on the accelerator. To avoid this problem, simply declare a local copy of the class instance value you wish to use within the function. In the following example, the function f1() will cause a crash while f2() will work correctly.

class A
{
public:
void f1();
void f2();
int my_value;
};
void A::f1()
{
auto out_a = viewIn(command,a);
command << RUNCOMMAND_LOOP1(iter,100){
out_a(iter) = my_value+5; // BAD !!
};
}
void A::f2()
{
auto out_a = viewIn(command,a);
int v = my_value;
command << RUNCOMMAND_LOOP1(iter,100){
out_a(iter) = v+5; // GOOD !!
};
}

Using the message exchange mechanism

Starting from version 3.10, Arcane supports "Accelerator Aware" MPI libraries. In this case, the buffer used for variable synchronizations is allocated directly on the accelerator. If a variable is used on the accelerator, this avoids unnecessary copies between the host and the accelerator. Shared memory message exchange mode also supports this mechanism.

If problems occur, this support can be disabled by setting the environment variable ARCANE_DISABLE_ACCELERATOR_AWARE_MESSAGE_PASSING to a non-zero value.

Multi-accelerator Management

Arcane associates an instance of Runner (accessible via ISubDomain::acceleratorMng()) when creating a subdomain. When a machine has multiple accelerators, Arcane by default chooses the first one returned in the available accelerators. This behavior can be changed by setting the environment variable ARCANE_ACCELERATOR_PARALLELMNG_RANK_FOR_DEVICE to a strictly positive value indicating the modulo between the subdomain rank (returned by IParallelMng::commRank() of ISubDomain::parallelMng()) and the accelerator index in the list of accelerators. For example, if this environment variable is 8, then the subdomain of rank N will be associated with the accelerator of index (N % 8). For this mechanism to work, the value of this environment variable must therefore be less than the number of accelerators available on the machine.

Memory Management

When multiple accelerators are available on the same machine, there is generally a "current" accelerator for each thread (for example, with CUDA it is possible to retrieve it using the cudaGetDevice() method and change it using the cudaSetDevice() method). When allocating memory on the accelerator, it is done on this "current" accelerator, and this memory will not be available on other accelerators. An instance of RunQueue is associated with a given accelerator, so you must ensure that the memory regions used by a command are accessible. If this is not the case, it will produce an error during execution (For example, with CUDA, this is error 400, whose message is "invalid resource handle").

If the "current" accelerator has been changed, for example, when calling an external library, it is possible to change it by calling the method Runner::setAsCurrentDevice().

Managing Connectivity and Entity Information

Accessing mesh connectivity is done differently on the accelerator than on the CPU for performance reasons. Specifically, it is not possible to use classic entities (Cell,Node, ...). Instead, you must use local identifiers such as CellLocalId or NodeLocalId.

The UnstructuredMeshConnectivityView class allows access to connectivity information. It is possible to define an instance of this class and keep it during the calculation. To initialize the instance, you must call the method UnstructuredMeshConnectivityView::setMesh().

Warning
Like all views, the instance is invalidated when the mesh changes. Therefore, you must call UnstructuredMeshConnectivityView::setMesh() again after modifying the mesh.

To access generic entity information, such as type or owner, you must use the ItemGenericInfoListView view.

The following example shows how to access cell nodes and mesh information. It iterates over all cells and calculates the barycenter for those that are in our subdomain and are hexahedrons.

class TestConnectivity
{
public:
TestConnectivity(IMesh* mesh, const RunQueue& queue)
: m_mesh(mesh)
, m_queue(queue)
, m_cells_center(VariableBuildInfo(mesh, "CellsCenterTest"))
{
m_connectivity_view.setMesh(mesh);
}
public:
void computeCenterForOwnHexa()
{
VariableNodeReal3& nodes_coord_var(m_mesh->nodesCoordinates());
auto command = makeCommand(m_queue);
auto in_node_coord = viewIn(command, nodes_coord_var);
auto out_cells_center = viewOut(command, m_cells_center);
// Cell->Node connectivity
Arcane::IndexedCellNodeConnectivityView cnc = m_connectivity_view.cellNode();
// Generic information for cells
Arcane::ItemGenericInfoListView cells_infos(m_mesh->cellFamily());
command << RUNCOMMAND_ENUMERATE (Cell, cid, m_mesh->allCells())
{
if (!cells_infos.isOwn(cid))
return;
if (cells_infos.typeId(cid) != IT_Hexaedron8)
return;
Real3 center;
// Iterate on nodes of Cell 'cid'
for (NodeLocalId node : cnc.nodes(cid))
center += in_node_coord[node];
out_cells_center[cid] = center / 8.0;
};
}
private:
Arcane::IMesh* m_mesh = nullptr;
Arcane::Accelerator::RunQueue m_queue;
Arcane::UnstructuredMeshConnectivityView m_connectivity_view;
Arcane::VariableCellReal3 m_cells_center;
};

Atomic Operations

The doAtomic method allows performing atomic operations. The supported operation types are defined by the eAtomicOperation enumeration. For example:

using namespace Arcane;
namespace ax = Arcane::Accelerator;
auto command = makeCommand(queue);
auto inout_a = viewInOut(command, v_sum);
Real v_to_add = 2.1;
constexpr auto Add = ax::eAtomicOperation::Add;
command << RUNCOMMAND_LOOP1(iter, 100)
{
// atomic add 'v' to 'inout_a(iter)'
ax::doAtomic<Add>(inout_a(iter), v_to_add);
};

Advanced Algorithms: Reductions, Scan, Filtering, Partitioning, and Sorting

Arcane offers several classes for performing more advanced algorithms. On the accelerator, these algorithms generally use libraries provided by the constructor (CUB for NVIDIA and rocprim for AMD). The algorithms proposed by Arcane therefore have the same limitations as the underlying constructor implementation.

The available classes are:

Standalone Accelerator Mode

It is possible to use Arcane's accelerator mode without support for high-level objects such as meshes or subdomains.

In this mode, it is possible to use the Arcane accelerator API directly from the main() function, for example. To use this mode, simply use the class method ArcaneLauncher::createStandaloneAcceleratorMng() after initializing Arcane:

static StandaloneAcceleratorMng createStandaloneAcceleratorMng()
Creates a standalone implementation to manage accelerators.
static void init(const CommandLineArguments &args)
Positions information from command-line arguments and initializes the launcher.
Standalone implementation of 'IAcceleratorMng.h'.

The launcher instance must remain valid as long as you wish to use the accelerator API. It is therefore preferable to define it in the code's main(). The StandaloneAcceleratorMng class uses a reference semantics. Therefore, it is possible to keep a reference to the instance anywhere in the code if necessary.

The 'standalone_accelerator' example shows such usage. For example, the following code allows offloading the sum of two arrays a and b into an array c on the accelerator.

#include <arcane/launcher/ArcaneLauncher.h>
#include "arcane/utils/NumArray.h"
#include "arcane/utils/Exception.h"
#include "arcane/accelerator/core/IAcceleratorMng.h"
#include "arcane/accelerator/NumArrayViews.h"
#include "arcane/accelerator/RunQueue.h"
#include "arcane/accelerator/RunCommandLoop.h"
void _testStandaloneLauncher()
{
using namespace Arcane;
// Créé une instance de Arcane::IAcceleratorMng autonome
// IMPORTANT: cette instance doit rester valide pendant
// toute l'exécution du progamme.
Arcane::IAcceleratorMng* acc_mng = launcher.acceleratorMng();
constexpr int nb_value = 10000;
// Teste la somme de deux tableaux 'a' et 'b' dans un tableau 'c'.
// Définit 2 tableaux 1D 'a' et 'b' et effectue leur initialisation sur CPU
for (int i = 0; i < nb_value; ++i) {
a(i) = i + 2;
b(i) = i + 3;
}
// Defínit le tableau 1D 'c' qui contiendra la somme de 'a' et 'b'
{
// Noyau de calcul déporté sur accélérateur.
auto command = makeCommand(acc_mng->defaultQueue());
// Indique que 'a' et 'b' seront en entrée et 'c' en sortie
auto in_a = viewIn(command, a);
auto in_b = viewIn(command, b);
auto out_c = viewOut(command, c);
// Réalise la somme sur accélérateur
command << RUNCOMMAND_LOOP1(iter, nb_value)
{
out_c(iter) = in_a(iter) + in_b(iter);
};
}
// Vérifie le résultat
Int64 total = 0.0;
for (int i = 0; i < nb_value; ++i)
total += c(i);
std::cout << "TOTAL=" << total << "\n";
Int64 expected_total = 100040000;
if (total != expected_total)
ARCANE_FATAL("Bad value for sum={0} (expected={1})", total, expected_total);
}
int main(int argc, char* argv[])
{
auto func = [&]
{
_testStandaloneLauncher();
};
}

Accelerator API arcanedoc_accelerator_materials