Skip to content

Commit

Permalink
fix overwrite of neuron devices when efa devices are also specified (#…
Browse files Browse the repository at this point in the history
…1656)

* fix overwrite of neuron devices when efa devices are also specified

* fix nodeOverrides on job submit error by capitalizing neuron device permissions in job definition

* black formatting
  • Loading branch information
emattia authored Feb 6, 2024
1 parent 3bbde92 commit aaf70e2
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions metaflow/plugins/aws/batch/batch_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ def _register_job_definition(
{
"containerPath": "/dev/neuron{}".format(i),
"hostPath": "/dev/neuron{}".format(i),
"permissions": ["read", "write"],
"permissions": ["READ", "WRITE"],
}
)

Expand Down Expand Up @@ -344,7 +344,15 @@ def _register_job_definition(
"Invalid efa value: ({}) (should be 0 or greater)".format(efa)
)
else:
job_definition["containerProperties"]["linuxParameters"]["devices"] = []
if "linuxParameters" not in job_definition["containerProperties"]:
job_definition["containerProperties"]["linuxParameters"] = {}
if (
"devices"
not in job_definition["containerProperties"]["linuxParameters"]
):
job_definition["containerProperties"]["linuxParameters"][
"devices"
] = []
if (num_parallel or 0) > 1:
# Multi-node parallel jobs require the container path and permissions explicitly specified in Job definition
for i in range(int(efa)):
Expand Down

0 comments on commit aaf70e2

Please sign in to comment.