Skip to content

Algorithm and Protocol Tuner for AWS

Raghu Raja edited this page Aug 16, 2024 · 8 revisions

With the v1.8.0-aws release, a new tuner component was added to the plugin. Currently, this plugin is only built for the AWS platform.

NCCL uses a combination of empirically derived costs for different aspects of data movement, latency/bandwidth reported by a particular network plugin, and other algorithm-specific coefficients to compute the cost of a specific algorithm/protocol combination. These costs are used to pick an algorithm and protocol at runtime based on the the size of the communicator and the specific message size of an operation. The default coefficients and empirically derived costs are not optimal for AWS's network.

The nccl_ofi_tuner component replaces NCCL's internal tuning mechanism with a custom tuner that uses a region-based approach to choose the best algorithm and protocol for a certain combination of number of nodes, number of ranks and data size. For each algorithm and protocol combination the tuner has a set of vertices defining a 2-D polygonal region in the (data size, number of ranks) space where that combination should be preferred (according to experimental results). When the tuner is queried with a certain data size and number of ranks, it considers (data size, number of ranks) as a point in the 2-D space and runs a ray-tracing algorithm to find the region that includes that point (and thus the algorithm and protocol combination to select). Regions are allowed to overlap and the tuner chooses the first one that is found to include the requested point, so regions are scanned in a specific order to check preferred regions first. Additionally, different values of number of ranks per node have different sets of regions. If a tuner query finds no region that includes the requested inputs, NCCL will fall back to his original internal tuner.

With NCCL v2.20.3 and newer, the tuner can be loaded by setting the following environment variable:

NCCL_TUNER_PLUGIN=$PATH_TO_AWS_OFI_NCCL_PLUGIN_INSTALL/lib/libnccl-ofi-tuner.so

To confirm if the tuner has been loaded, enable the tuner logging:

NCCL_DEBUG_SUBSYS=INIT,TUNING

and verify you see the equivalent of the following entries in the log:

hostname:46348 [7] NCCL INFO NCCL_TUNER_PLUGIN set to /home/user/aws-ofi-nccl/install/libnccl-ofi-tuner.so
hostname:46279:46348 [7] NCCL INFO Opened tuner: 'nccl_ofi_tuner'
hostname:46279:46348 [7] NCCL INFO Using tuner plugin: 'nccl_ofi_tuner'

Note: There are some known issues with NCCL v2.20.3 when an external tuner is loaded and a single process is used to manage multiple GPUs. This has been resolved in newer versions of NCCL.

Clone this wiki locally