fix-66 #74

MichaelClifford · 2024-10-08T13:34:51Z

Closes #66

These changes have been tested on the MOC cluster and the training run completes as part of a Kubeflow pipeline.

This PR changes:

The image for model training has beed changed from a custom model to the RHEL AI 1.1 image
Replaced the custom script for running training with a direct torchrun call instead

Signed-off-by: Michael Clifford <mcliffor@redhat.com>

Shreyanand

LGTM, The PR adds the torchrun command to the pytorch job component. The command directly calls the instructlab main_ds training script that eliminates the need for additional scripts such as run_main_ds.py that were previously used.

MichaelClifford requested a review from Shreyanand October 8, 2024 13:46

MichaelClifford force-pushed the fix-66 branch 3 times, most recently from b4d58d2 to 0d53b31 Compare October 8, 2024 23:24

MichaelClifford mentioned this pull request Oct 8, 2024

Switch to FSDP #51

Open

MichaelClifford force-pushed the fix-66 branch 3 times, most recently from d7ea4ad to b807ceb Compare October 9, 2024 02:06

WIP: fix-66

9b0fd55

Signed-off-by: Michael Clifford <mcliffor@redhat.com>

MichaelClifford force-pushed the fix-66 branch from b807ceb to 9b0fd55 Compare October 9, 2024 13:00

MichaelClifford changed the title ~~WIP: fix-66~~ fix-66 Oct 9, 2024

MichaelClifford marked this pull request as ready for review October 9, 2024 14:13

Shreyanand approved these changes Oct 9, 2024

View reviewed changes

Shreyanand merged commit 249dfd3 into redhat-et:main Oct 9, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix-66 #74

fix-66 #74

MichaelClifford commented Oct 8, 2024 •

edited

Loading

Shreyanand left a comment •

edited

Loading

fix-66 #74

fix-66 #74

Conversation

MichaelClifford commented Oct 8, 2024 • edited Loading

Shreyanand left a comment • edited Loading

Choose a reason for hiding this comment

MichaelClifford commented Oct 8, 2024 •

edited

Loading

Shreyanand left a comment •

edited

Loading