Releases · NVIDIA/cloudai

Refactor SlurmCommandGenStrategy (_write_sbatch_script) by @TaekyungHeo in #253
Refactor JaxToolboxSlurmCommandGenStrategy unit tests by @TaekyungHeo in #259
Handle node allocation errors gracefully, log details, and exit on failure by @TaekyungHeo in #264

Full Changelog: v0.9.beta18...v0.9.beta19

Contributors

TaekyungHeo

Assets 2

14 Oct 15:30

amaslenn

v0.9.beta18

0af4cd9

v0.9.beta18 Pre-release

Pre-release

What's Changed

Cleanup docs from mentioning --mode option by @amaslenn in #260
Improve verify modes by @amaslenn in #262

Full Changelog: v0.9.beta17...v0.9.beta18

Contributors

amaslenn

Assets 2

11 Oct 15:28

TaekyungHeo

v0.9.beta17

ae67c34

v0.9.beta17 Pre-release

Pre-release

What's Changed

Refactor JaxToolboxSlurmCommandGenStrategy by @TaekyungHeo in #254
Move JaxToolbox-related test definitions to CloudAI by @TaekyungHeo in #257

Full Changelog: v0.9.beta16...v0.9.beta17

Contributors

TaekyungHeo

Assets 2

11 Oct 06:51

amaslenn

v0.9.beta16

156ccf9

v0.9.beta16 Pre-release

Pre-release

Highlights

Use subcommands instead of --mode <value> by @amaslenn in #194

New help message looks like this:

> cloudai --help
usage: cloudai [-h] [--log-file LOG_FILE] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
               {uninstall,install,dry-run,run,generate-report,verify-systems,verify-tests,verify-test-scenarios} ...

Cloud AI

optional arguments:
  -h, --help            show this help message and exit
  --log-file LOG_FILE   The name of the log file (default: debug.log).
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level (default: INFO).

modes:
  {uninstall,install,dry-run,run,generate-report,verify-systems,verify-tests,verify-test-scenarios}
    uninstall           Remove the installed dependencies.
    install             Prepare execution by setting up env and dependencies for the tests to run.
    dry-run             Perform a dry-run of the test scenarios without executing them.
    run                 Execute the test scenarios.
    generate-report     Generate a report based on the test results.
    verify-systems      Verify the system configurations.
    verify-tests        Verify the test configurations.
    verify-test-scenarios
                        Verify the test scenario configurations.

Each command (a.k.a mode) has own help message.

Each command also has a uniq set of required and optional arguments. While for many commands options are the same, others are quite different, for example:

> cloudai run --help
usage: cloudai run [-h] --system-config SYSTEM_CONFIG --tests-dir TESTS_DIR --test-scenario TEST_SCENARIO [--output-dir OUTPUT_DIR]

optional arguments:
  -h, --help            show this help message and exit
  --system-config SYSTEM_CONFIG
                        Path to the system configuration file.
  --tests-dir TESTS_DIR
                        Path to the test configuration directory.
  --test-scenario TEST_SCENARIO
                        Path to the test scenario file.
  --output-dir OUTPUT_DIR
                        Path to the output directory.

> cloudai verify-tests --help
usage: cloudai verify-tests [-h] test_configs

positional arguments:
  test_configs  Path to the test configuration file or directory.

optional arguments:
  -h, --help    show this help message and exit

What's Changed

Refactor NeMoLauncherSlurmCommandGenStrategy unit tests by @TaekyungHeo in #252
Refactor JaxToolboxSlurmCommandGenStrategy by @TaekyungHeo in #249

Full Changelog: v0.9.beta15...v0.9.beta16

Contributors

amaslenn and TaekyungHeo

Assets 2

09 Oct 13:56

TaekyungHeo

v0.9.beta15

2b6181b

v0.9.beta15 Pre-release

Pre-release

What's Changed

Remove assigning null when the value is null (NeMo launcher) by @TaekyungHeo in #250

Full Changelog: v0.9.beta14...v0.9.beta15

Contributors

TaekyungHeo

Assets 2

09 Oct 13:23

amaslenn

v0.9.beta14

7455f42

v0.9.beta14 Pre-release

Pre-release

What's Changed

Fix bug in violating Kubernetes naming rules by @TaekyungHeo in #244
Add unit tests for SlurmCommandGenStrategy by @TaekyungHeo in #247
Fix missing 'output_path' in cmd_args by @amaslenn in #251

Full Changelog: v0.9.beta13...v0.9.beta14

Contributors

amaslenn and TaekyungHeo

Assets 2

09 Oct 09:53

amaslenn

v0.9.beta13

e54f4c1

v0.9.beta13 Pre-release

Pre-release

What's Changed

Update Sleep to ensure implementation consistency by @TaekyungHeo in #234
Update USER_GUIDE.md and README.md by @TaekyungHeo in #235
Remove duplicated _format_env_vars calls by @TaekyungHeo in #233
Rename test definitions by @TaekyungHeo in #237
Remove unnecessary arg from generate_test_command by @TaekyungHeo in #238
Spin-off cmd_args validation logic for SlurmCommandGenStrategy by @TaekyungHeo in #236
Expect SlurmSystem in respective cmd_gen and installer classes by @amaslenn in #239
Move more fields from Test to TestRun by @amaslenn in #240
Make TestDefinition a part of Test by @amaslenn in #241
Minor refactoring on SlurmCommandGenStrategy by @TaekyungHeo in #246
Break down test_slurm_command_gen_strategy into smaller tests by @TaekyungHeo in #245
Resolve K8s Comments (Part 1) by @TaekyungHeo in #242
Fix race condition during docker images caching by @amaslenn in #248

Full Changelog: v0.9.beta12...v0.9.beta13

Contributors

amaslenn and TaekyungHeo

Assets 2

07 Oct 17:38

TaekyungHeo

v0.9.beta12

c40e92c

v0.9.beta12 Pre-release

Pre-release

Highlights

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for Test Scenario configs. This is a continuation of #145.

Tests becomes and array. This helps making case names more expressive:
before:

[Tests.1]
# ...

now:

[[Tests]]
id = "any-name.you_want" # before it was just "1"

id field is mandatory and must be unique and is used to specify dependencies:

[[Tests]]
id = "Tests.1"
# ...

[[Tests]]
id = "Tests.2"
# ...
  [[Tests.dependencies]]
  id = "Tests.1"
  # ...

name (under the list of tests) renamed to test_name to better reflect its meaning. It still references a test defined in a separate TOML file.

Dependencies converted to a list to support multiple dependencies of the same type.
before

# ...

[Tests.2]
name = "ucc_test_alltoall"
  [Tests.2.dependencies]
  start_post_comp = { name = "Tests.1", time = 0 }  # only one dependency of this type is allowed

now

# ...

[[Tests]]
id = "Tests.3"
test_name = "ucc_test_alltoall"
# ...
  [[Tests.dependencies]]
  type = "start_post_comp"
  id = "Tests.1"

  [[Tests.dependencies]]
  type = "start_post_comp"
  id = "Tests.2"

What's Changed

Cover wrong python bin path in exec script bug by @amaslenn in #232
Pydantic for Test Scenario by @amaslenn in #205

Full Changelog: v0.9.beta11...v0.9.beta12

Contributors

amaslenn

Assets 2

07 Oct 15:57

TaekyungHeo

v0.9.beta11

d7f4235

v0.9.beta11 Pre-release

Pre-release

What's Changed

Pass TestRun to gen_exec_command and gen_json by @TaekyungHeo in #228
Bug fix for incorrect py_bin in NeMoLauncher by @TaekyungHeo in #231

Full Changelog: v0.9.beta10...v0.9.beta11

Contributors

TaekyungHeo

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Highlights

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Highlights

What's Changed

Contributors

What's Changed

Contributors

Releases: NVIDIA/cloudai

v0.9.beta20

What's Changed

Contributors

v0.9.beta19

What's Changed

Contributors

v0.9.beta18

What's Changed

Contributors

v0.9.beta17

What's Changed

Contributors

v0.9.beta16

Highlights

What's Changed

Contributors

v0.9.beta15

What's Changed

Contributors

v0.9.beta14

What's Changed

Contributors

v0.9.beta13

What's Changed

Contributors

v0.9.beta12

Highlights

What's Changed

Contributors

v0.9.beta11

What's Changed

Contributors