Skip to content

Latest commit

 

History

History
955 lines (717 loc) · 36.4 KB

utilization.md

File metadata and controls

955 lines (717 loc) · 36.4 KB
title
Utilisation cheat sheet for Kokkos

Kokkos utilization cheat sheet

Warning Only for Kokkos 4.2 and more, for older verison look at the doc.

  1. title: Utilisation cheat sheet for Kokkos
  2. Header
  3. Initialization
    1. Initialize and finalize
    2. Scope guard
  4. Kokkos concepts
    1. Execution spaces
    2. Memory spaces
      1. Generic memory spaces
      2. Specific memory spaces
  5. Memory management
    1. View
      1. Create
      2. Manage 1. Resize and preserve content 2. Reallocate and do not preserve content
    2. Memory Layouts
    3. Memory trait
    4. Deep copy
    5. Mirror view
      1. Create and always allocate on host
      2. Create and allocate on host if source view is not in host space
      3. Create, allocate and synchronize if source view is not in same space as destination view
    6. Subview
    7. Scatter view (experimental)
      1. Specific header
      2. Create
      3. Scatter operation
      4. Scatter
      5. Compute
      6. Gather
  6. Parallelism patterns
    1. For loop
    2. Reduction
    3. Fences
      1. Global fence
      2. Execution space fence
      3. Team barrier
  7. Execution policy
    1. Create
    2. Ranges
      1. One-dimensional range
      2. Multi-dimensional (dimension 2)
    3. Hierarchical parallelism
      1. Team policy
      2. Team vector level (2-level hierarchy)
        1. One-dimensional range
        2. Multi-dimensional range (dimension 2)
      3. Team thread vector level (3-level hierarchy)
        1. One-dimensional range
        2. Multi-dimensional range (dimension 2)
  8. Scratch memory
    1. Scratch memory space
    2. Create and populate
  9. Atomics
    1. Atomic operations
    2. Atomic exchanges
  10. Mathematics
    1. Math functions
    2. Complex numbers
      1. Create
      2. Manage
  11. Utilities
    1. Code interruption
    2. Print inside a kernel
    3. Timer
      1. Create
      2. Manage
    4. Manage parallel environment
  12. Macros
    1. Essential macros
    2. Extra macros

Header

#include <Kokkos_Core.hpp>

Initialization

Initialize and finalize

int main(int argc, char* argv[]) {
    Kokkos::initialize(argc, argv);
    { /* ... */ }
    Kokkos::finalize();
    return 0;
}

Scope guard

int main(int argc, char* argv[]) {
    Kokkos::ScopeGuard kokkos(argc, argv);
    /* ... */
    return 0;
}

Kokkos concepts

Execution spaces

Execution space Device backend Host backend
Kokkos::DefaultExecutionSpace On device On host
Kokkos::DefaultHostExecutionSpace On host On host

Doc https://kokkos.org/kokkos-core-wiki/API/core/execution_spaces.html

Memory spaces

Doc https://kokkos.org/kokkos-core-wiki/API/core/memory_spaces.html

Generic memory spaces

Memory space Device backend Host backend
Kokkos::DefaultExecutionSpace::memory_space On device On host
Kokkos::DefaultHostExecutionSpace::memory_space On host On host

Specific memory spaces

Memory space Description
Kokkos::HostSpace Accessible by the host but not directly by the device
Kokkos::SharedSpace Accessible by the host and the device; copy managed by the driver
Kokkos::ScratchMemorySpace Accessible by the team or the thread that created it and nothing else
Examples
// Host space
Kokkos::View<double*, Kokkos::HostSpace> hostView("hostView", numberOfElements);

// Shared space
Kokkos::View<double*, Kokkos::SharedSpace> sharedView("sharedView", numberOfElements);

// Scratch memory space
Kokkos::parallel_for(
    Kokkos::TeamPolicy<>(leagueSize, teamSize),
    KOKKOS_LAMBDA (const Kokkos::TeamPolicy<>::member_type& team) {
        // Allocate scratch memory for each team
        Kokkos::View<double*, Kokkos::ScratchMemorySpace> scratchView(team.team_scratch(1), scratchSize);
    }
);

Memory management

View

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/view.html

Create

Kokkos::View<DataType, LayoutType, MemorySpace, MemoryTraits> view("label", numberOfElementsAtRuntimeI, numberOfElementsAtRuntimeJ);
Template argument Description
DataType ScalarType for the data type, followed by a * for each runtime dimension, then by a [numberOfElements] for each compile time dimension
LayoutType See memory layouts
MemorySpace See memory spaces
MemoryTraits See memory traits

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/view.html#constructors

Examples
// A 1D view of doubles
Kokkos::View<double*> view1D("view1D", 100);

// Const view of doubles
Kokkos::View<const double*> constView1D("constView1D", 100) = view1D;

// A 2D view of integers
Kokkos::View<int**> view2D("view2D", 50, 50);

// 3D view with 2 runtime dimensions and 1 compile time dimension
Kokkos::View<double**[25]> view3D("view3D", 50, 42, 25);

Manage

Method Description
(i, j...) Returns and sets the value at index i, j, etc.
size() Returns the total number of elements in the view
rank() Returns the number of dimensions
layout() Returns the layout of the view
extent(dim) Returns the number of elements in the requested dimension
data() Returns a pointer to the underlying data

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/view.html#data-access-functions

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/view.html#data-layout-dimensions-strides

Resize and preserve content
Kokkos::resize(view, newNumberOfElementsI, newNumberOfElementsJ...);

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/resize.html

Reallocate and do not preserve content
Kokkos::realloc(view, newNumberOfElementsI, newNumberOfElementsJ...);

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/realloc.html

Memory Layouts

Layout Description Default
Kokkos::LayoutRight Strides increase from the right most to the left most dimension, also known as row-major or C-like CPU
Kokkos::LayoutLeft Strides increase from the left most to the right most dimension, also known as column-major or Fortran-like GPU
Kokkos::LayoutStride Strides can be arbitrary for each dimension

By default, a layout suited for loops on the high frequency index is used.

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/view.html#data-layout-dimensions-strides

Example
// 2D view with LayoutRight
Kokkos::View<double**, Kokkos::LayoutRight> view2D("view2D", 50, 50);

Memory trait

Memory traits are indicated with Kokkos::MemoryTraits<> and are combined with the | (pipe) operator.

Memory trait Description
Kokkos::Unmanaged The allocation has to be managed manually
Kokkos::Atomic All accesses to the view are atomic
Kokkos::RandomAccess Hint that the view is used in a random access manner; if the view is also const this may trigger more efficient load operations on GPUs
Kokkos::Restrict There is no aliasing of the view by other data structures in the current scope

Doc https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html#access-traits

Examples
// Unmanaged view on CPU
double* data = new double[numberOfElements];
Kokkos::View<double*, Kokkos::MemoryTraits<Kokkos::Unmanaged>> unmanagedView(data, numberOfElements);

// Unmanaged view on GPU using CUDA
double* data;
cudaMalloc(&data, numberOfElements * sizeof(double));
Kokkos::View<double*, Kokkos::MemoryTraits<Kokkos::Unmanaged>, Kokkos::CudaSpace> unmanagedView(data, numberOfElements);

// Atomic view
Kokkos::View<double*, Kokkos::MemoryTraits<Kokkos::Atomic>> atomicView("atomicView", numberOfElements);

// Random access with constant data
// first, allocate non constant view
Kokkos::View<int*> nonConstView ("data", numberOfElements);
// then, make it constant
Kokkos::View<const int*, Kokkos::MemoryTraits<Kokkos::RandomAccess>> randomAccessView = nonConstView;

// Unmanaged, atomic, random access view on GPU using CUDA
double* data;
cudaMalloc(&data, numberOfElements* sizeof(double));
Kokkos::View<const double*,  Kokkos::CudaSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged | Kokkos::Atomic | Kokkos::RandomAccess>> unmanagedView(data, numberOfElements);

Deep copy

Warning Copying or assigning a view does a shallow copy, data are not synchronized in this case.

Kokkos::deep_copy(dest, src);

The views must have the same dimensions, data type, and reside in the same memory space (mirror views can be deep copied on different memory spaces).

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/deep_copy.html

Code

Example
Kokkos::View<double*> view1("view1", numberOfElements);
Kokkos::View<double*> view2("view2", numberOfElements);

// Deep copy of view1 to view2
Kokkos::deep_copy(view2, view1);

Mirror view

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/create_mirror.html

Code

Create and always allocate on host

auto mirrorView = Kokkos::create_mirror(view);

Create and allocate on host if source view is not in host space

auto mirrorView = Kokkos::create_mirror_view(view);

Create, allocate and synchronize if source view is not in same space as destination view

auto mirrorView = Kokkos::create_mirror_view_and_copy(ExecutionSpace(), view);

Subview

A subview has the same reference count as its parent view, so the parent view won't be deallocated before all subviews go away.

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/subview.html

Doc https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Subviews.html

auto subview = Kokkos::subview(view, Kokkos::ALL, Kokkos::pair(rangeFirst, rangeLast), value);
Subset selection Description
Kokkos::ALL All elements in this dimension
Kokkos::pair Range of elements in this dimension
value Specific element in this dimension

Scatter view (experimental)

Doc https://kokkos.org/kokkos-core-wiki/API/containers/ScatterView.html

Training https://github.com/kokkos/kokkos-tutorials/blob/main/LectureSeries/KokkosTutorial_03_MDRangeMoreViews.pdf

Code

Specific header

#include <Kokkos_ScatterView.hpp>

Create

ScatterView<DataType, Operation, ExecutionSpace, Layout, Contribution> scatter(targetView);
Template argument Description
DataType Scalar type of the view and its dimensionality
Operation See scatter operation; defaults to Kokkos::Experimental::ScatterSum
ExecutionSpace See execution spaces; defaults to Kokkos::DefaultExecutionSpace
Layout See layouts
Duplication Whether to duplicate the grid or not; defaults to Kokkos::Experimental::ScatterDuplicated, other option is Kokkos::Experimental::ScatterNonDuplicated
Contribution Whether to contribute to use atomics; defaults to Kokkos::Experimental::ScatterAtomic, other option is Kokkos::Experimental::ScatterNonAtomic

Scatter operation

Operation Description
Kokkos::Experimental::ScatterSum Sum
Kokkos::Experimental::ScatterProd Product
Kokkos::Experimental::ScatterMin Minimum value
Kokkos::Experimental::ScatterMax Maximum value

Scatter

auto access = scatter.access();

Compute

access(index) += value;

Gather

Kokkos::Experimental::contribute(targetView, scatter);
Full example
#include<Kokkos_ScatterView.hpp>

// Compute histogram of values in view1D
KOKKOS_INLINE_FUNCTION int getIndex(double pos) { /* ... */ }
KOKKOS_INLINE_FUNCTION double compute(double weight) { /* ... */ }

// List of elements to process
Kokkos::View<double*> positions("positions", 100);
Kokkos::View<double*> weight("weight", 100);

// Historgram of N bins
Kokkos::View<double*> histogram("bar", N);

Kokkos::Experimental::ScatterView<double*> scatter(histogram);
Kokkos::parallel_for(
    100,
    KOKKOS_LAMBDA(const int i) {
        // scatter
        auto access = scatter.access();

        // compute
        auto index = getIndex(positions(i);
        auto contribution = compute(weight(i);
        access(index) += contribution;
    }
);

// gather
Kokkos::Experimental::contribute(histogram, scatter);

Parallelism patterns

For loop

Kokkos::parallel_for(
    "label",
    ExecutionPolicy</* ... */>(/* ... */),
    KOKKOS_LAMBDA (/* ... */) { /* ... */ }
);

Reduction

ScalarType result;
Kokkos::parallel_reduce(
    "label",
    ExecutionPolicy</* ... */>(/* ... */),
    KOKKOS_LAMBDA (/* ... */, ScalarType& resultLocal) { /* ... */ },
    Kokkos::ReducerConcept<ScalarType>(result)
);

With Kokkos::ReducerConcept being one of the following:

Reducer Operation Description
Kokkos::BAnd & Binary and
Kokkos::BOr | Binary or
Kokkos::LAnd && Logical and
Kokkos::LOr || Logical or
Kokkos::Max std::max Maximum
Kokkos::MaxLoc std::max_element Maximum and associated index
Kokkos::Min std::min Minimum
Kokkos::MinLoc std::min_element Minimum and associated index
Kokkos::MinMax std::minmax Minimun and maximum
Kokkos::MinMaxLoc std::minmax_element Minimun and maximun and associated indices
Kokkos::Prod * Product
Kokkos::Sum + Sum

The reducer class can be omitted for Kokkos::Sum.

Doc https://kokkos.org/kokkos-core-wiki/API/core/parallel-dispatch/parallel_reduce.html

Doc https://kokkos.org/kokkos-core-wiki/API/core/builtin_reducers.html

Fences

Global fence

Kokkos::fence();

Doc https://kokkos.org/kokkos-core-wiki/API/core/parallel-dispatch/fence.html

Execution space fence

ExecutionSpace().fence();

Team barrier

Kokkos::TeamPolicy<>::member_type().team_barrier();

Doc https://kokkos.org/kokkos-core-wiki/API/core/execution_spaces.html#functionality

Execution policy

Create

ExecutionPolicy<ExecutionSpace, Schedule, IndexType, LaunchBounds, WorkTag> policy(/* ... */);
Template argument Description
ExecutionSpace See execution spaces; defaults to Kokkos::DefaultExecutionSpace
Schedule How to schedule work items; defaults to machine and backend specifics
IndexType Integer type to be used for the index; defaults to int64_t
LaunchBounds Hints for CUDA and HIP launch bounds
WorkTag Empty tag class to call the functor

Doc https://kokkos.org/kokkos-core-wiki/API/core/Execution-Policies.html

Ranges

One-dimensional range

Kokkos::RangePolicy<ExecutionSpace, Schedule, IndexType LaunchBounds, WorkTag> policy(first, last);

If the range starts at 0 and uses default parameters, can be replaced by just the number of elements.

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/RangePolicy.html

Multi-dimensional (dimension 2)

Kokkos::MDRangePolicy<ExecutionSpace, Schedule, IndexType, LaunchBounds, WorkTag, Kokkos::Rank<2>> policy({firstI, firstJ}, {lastI, lastJ});

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/MDRangePolicy.html

Hierarchical parallelism

Doc https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html

Team policy

Kokkos::TeamPolicy<ExecutionSpace, Schedule, IndexType, LaunchBounds, WorkTag> policy(leagueSize, teamSize);

Usually, teamSize is replaced by Kokkos::AUTO to let Kokkos determine it. A kernel running in a team policy has a Kokkos::TeamPolicy<>::member_type argument:

Method Description
league_size() Number of teams in the league
league_rank() Index of the team withing the league
team_size() Number of threads in the team
team_rank() Index of the thread within the team

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/TeamPolicy.html

Team vector level (2-level hierarchy)

Kokkos::parallel_for(
    "label",
    Kokkos::TeamPolicy<>(numberOfElementsI, Kokkos::AUTO),
    KOKKOS_LAMBDA (const Kokkos::TeamPolicy<>::member_type& teamMember) {
        const int i = teamMember.team_rank();

        Kokkos::parallel_for(
            Kokkos::TeamVectorRange(teamMember, firstJ, lastJ),
            [=] (const int j) { /* ... */ }
        );
    }
);
One-dimensional range
Kokkos::TeamVectorRange range(teamMember, firstJ, lastJ);

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/TeamVectorRange.html

Multi-dimensional range (dimension 2)
Kokkos::TeamVectorMDRange<Kokkos::Rank<2>, Kokkos::TeamPolicy<>::member_type> range(teamMember, numberOfElementsJ, numberOfElementsK);

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/TeamVectorMDRange.html

Team thread vector level (3-level hierarchy)

Kokkos::parallel_for(
    "label",
    Kokkos::TeamPolicy<>(numberOfElementsI, Kokkos::AUTO),
    KOKKOS_LAMBDA (const Kokkos::TeamPolicy<>::member_type& teamMember) {
        const int i = teamMember.team_rank();

        Kokkos::parallel_for(
            Kokkos::TeamThreadRange(teamMember, firstJ, lastJ),
            [=] (const int j) {
                Kokkos::parallel_for(
                    Kokkos::ThreadVectorRange(teamMember, firstK, lastK),
                    [=] (const int k) { /* ... */ }
                );
            }
        );
    }
);
One-dimensional range
Kokkos::TeamThreadRange range(teamMember, firstJ, lastJ);
Kokkos::ThreadVectorRange range(teamMember, firstK, lastK);

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/TeamThreadRange.html

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/ThreadVectorRange.html

Multi-dimensional range (dimension 2)
Kokkos::TeamThreadMDRange<Kokkos::Rank<2>, Kokkos::TeamPolicy<>::member_type> range(teamMember, numberOfElementsJ, numberOfElementsK);
Kokkos::ThreadVectorMDRange<Kokkos::Rank<2>, Kokkos::TeamPolicy<>::member_type> range(teamMember, numberOfElementsL, numberOfElementsM);

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/TeamThreadMDRange.html

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/ThreadVectorMDRange.html

Scratch memory

Each team has access to a scratch memory pad, which has the team's lifetime, and is only accessible by the team's threads.

Doc https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html#team-scratch-pad-memory

Scratch memory space

Level Memory size Access speed
0 Limited (tens of kilobytes) Fast
1 Larger (few gigabytes) Medium

Create and populate

// Define a scratch memory view type
using ScratchPadView = View<double*, ExecutionSpace::scratch_memory_space, MemoryUnmanaged>;

// Compute how much scratch memory (in bytes) is needed
size_t bytes = ScratchPadView::shmem_size(vectorSize);

Kokkos::parallel_for(
    Kokkos::TeamPolicy<ExecutionSpace>(leagueSize, teamSize).set_scratch_size(spaceLevel, Kokkos::PerTeam(bytes)),
    KOKKOS_LAMBDA (const Kokkos::TeamPolicy<>::member_type& teamMember) {
        const int i = teamMember.team_rank();

        // Create a view for the scratch pad
        ScratchPadView scratch(teamMember.team_scratch(spaceLevel), vectorSize);

        // Initialize it
        Kokkoss::parallel_for(
            Kokkos::ThreadVectorRange(teamMember, vectorSize),
            [=] (const int j) { scratch(j) = view(i, j); }
        );

        // Synchronize
        teamMember.team_barrier();
    }
);

Atomics

Doc https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Atomic-Operations.html

Doc https://kokkos.org/kokkos-core-wiki/API/core/atomics.html

Atomic operations

Operation Replaces
Kokkos::atomic_add +=
Kokkos::atomic_and &=
Kokkos::atomic_assign =
Kokkos::atomic_decrement --
Kokkos::atomic_increment ++
Kokkos::atomic_max std::max on previous and new value
Kokkos::atomic_min std::min on previous and new value
Kokkos::atomic_or |=
Kokkos::atomic_sub -=
Example
Kokkos::parallel_for(
    numberOfElements,
    KOKKOS_LAMBDA (const int i) {
        const int value = /* ... */;
        const int bucketIndex = computeBucketIndex (value);
        Kokkos::atomic_increment(&histogram(bucketIndex));
    }
);

Atomic exchanges

Operation Description
Kokkos::atomic_exchange Assign destination to new value and return old value
Kokkos::atomic_compare_exchange_strong Assign destination to new value if old value equals a comparison value
Example
// Assign destination to new value and return old value
int new = 20;
int destination = 10;
int old = atomic_exchange(&destination, new);

// Assign destination to new value if old value equals a comparison value
int new = 20;
int destination = 10;
int comparison = 10;
bool success = atomic_compare_exchange_strong(&destination, comparison, new);

Mathematics

Math functions

Function type List of functions (prefixed by Kokkos::)
Basic operations abs, fabs, fmod, remainder, fma, fmax, fmin, fdim, nan
Exponential exp, exp2, expm1, log, log2, log10, log1p
Power pow, sqrt, cbrt, hypot
Trigonometric sin, cos, tan, asin, acos, atan, atan2
Hyperbolic sinh, cosh, tanh, asinh, acosh, atanh
Error and gamma erf, erfc, tgamma, lgamma
Nearest ceil, floor, trunc, round, nearbyint
Floating point logb, nextafter, copysign
Comparisons isfinite, isinf, isnan, signbit

Note that not all C++ standard math functions are available.

Doc https://kokkos.org/kokkos-core-wiki/API/core/numerics/mathematical-functions.html?highlight=math

Complex numbers

Create

Kokkos::complex<double> complex(realPart, imagPart);

Manage

Method Description
real() Returns or sets the real part
imag() Returns or sets the imaginary part

Doc https://kokkos.org/kokkos-core-wiki/API/core/utilities/complex.html

Utilities

Doc https://kokkos.org/kokkos-core-wiki/API/core/Utilities.html

Code interruption

Kokkos::abort("message");

Print inside a kernel

Kokkos::printf("format string", arg1, arg2);

Similar to std::printf.

Timer

Create

Kokkos::Timer timer;

Manage

Method Description
seconds() Returns the time in seconds since construction or last reset
reset() Resets the timer to zero

Manage parallel environment

Function Description
Kokkos::device_id() Returns the device ID of the current device
Kokkos::num_devices() Returns the number of devices available to the current execution space

Macros

Doc https://kokkos.org/kokkos-core-wiki/API/core/Macros.html

Essential macros

Macro Description
KOKKOS_LAMBDA Replaces capture argument for lambdas
KOKKOS_INLINE_FUNCTION Inlined functor attribute
KOKKOS_FUNCTION Functor attribute

Extra macros

Macro Description
KOKKOS_VERSION Kokkos full version
KOKKOS_VERSION_MAJOR Kokkos major version
KOKKOS_VERSION_MINOR Kokkos minor version
KOKKOS_VERSION_PATCH Kokkos patch level
KOKKOS_ENABLE_* Any equivalent CMake option passed when building Kokkos, see installation cheat sheet
KOKKOS_ARCH_* Any equivalent CMake option passed when building Kokkos, see installation cheat sheet