feat: remove dependency on KFP lib + eval serving endpoint support + ci #76

leseb · 2024-10-08T20:42:42Z

195862b feat: remove dependency on KFP lib
7d8d314 feat: add serving endpoint details for eval
6852064 ci: add a check to ensure the pipeline is up to date
ca03343 fix: clarify the structure of the initial tarball
2b9d01e bulk commit
2eb8e1c fix: eval, do not use external deps
8a3cef9 misc: use the correct images

commit 195862b
Author: Sébastien Han seb@redhat.com
Date: Tue Oct 8 22:41:10 2024 +0200

feat: remove dependency on KFP lib

We now use a generic python executor to call the eval and any DSL
component functions. They are almost identical, only the signature is
updated to not use KFP Classes.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit 7d8d314
Author: Sébastien Han seb@redhat.com
Date: Wed Oct 9 09:54:26 2024 +0200

feat: add serving endpoint details for eval

During the initial MT-Bench eval, we must connect to a serving endpoint.
The script has been updated with new required flags:

* `--eval-serving-endpoint`: Serving endpoint for evaluation. e.g:
  http://serving.kubeflow.svc.cluster.local:8080/v1 - **Required**
* `--eval-serving-model-name`: The name of the model to use for evaluation. **Required**
* `--eval-serving-model-api-key`: The API key for the model to evaluate. `EVAL_SERVING_MODEL_API_KEY`
  environment variable can be used as well. **Required**

Signed-off-by: Sébastien Han <seb@redhat.com>

commit 6852064
Author: Sébastien Han seb@redhat.com
Date: Wed Oct 9 09:48:25 2024 +0200

ci: add a check to ensure the pipeline is up to date

If someone pushes code changes and the pipeline.yaml has not been
updated accordingly, the CI will fail and ask to update it and push
again.

Also, you can now generate a pipeline with `make pipeline`.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit ca03343
Author: Sébastien Han seb@redhat.com
Date: Wed Oct 9 14:40:27 2024 +0200

fix: clarify the structure of the initial tarball

The tarball located on S3 must contains both SDG data and the model to
train.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit 2b9d01e
Author: Sébastien Han seb@redhat.com
Date: Wed Oct 9 23:14:44 2024 +0200

bulk commit

Sorry, I went really far with this one but I can confirm that:

* sdg-data-fetch is working
* data processing works
* training phase 1 is working
* training phase 2 is working

Also:

* remove backtick from the code since it breaks the shell that runs the
  python executor
* only use a single PVC for everything: sdg data, model, trained model
* --force-pull: to force pulling from the object store again if the data
  are already present

Signed-off-by: Sébastien Han <seb@redhat.com>

commit 2eb8e1c
Author: Sébastien Han seb@redhat.com
Date: Thu Oct 10 12:02:14 2024 +0200

fix: eval, do not use external deps

The pipeline excepts to have all the functions at its disposal to run.
So helper packages cannot be used.
In this case, helpers only works because the current eval is using a
custom image.
Let's move all the logic inside the component.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit 8a3cef9
Author: Sébastien Han seb@redhat.com
Date: Thu Oct 10 12:49:27 2024 +0200

misc: use the correct images

Signed-off-by: Sébastien Han <seb@redhat.com>

sallyom · 2024-10-09T14:41:18Z

standalone/README.md

+  --namespace my-namespace \
+  --eval-serving-endpoint http://serving.kubeflow.svc.cluster.local:8080/v1 \
+  --eval-serving-model-name my-model \
+  --eval-serving-model-api-key ***** \


canthis be judge-serving instead of eval-serving? ilab uses the term judge_model and --judge-model (CLI doesn't yet support external endpoint or api_key though)

--judge-serving-model-name --judge-serving-endpoint --judge-serving-model-api-key

We now use a generic python executor to call the eval and any DSL component functions. They are almost identical, only the signature is updated to not use KFP Classes. Signed-off-by: Sébastien Han <seb@redhat.com>

During the initial MT-Bench eval, we must connect to a serving endpoint. The script has been updated with new required flags: * `--eval-serving-endpoint`: Serving endpoint for evaluation. e.g: http://serving.kubeflow.svc.cluster.local:8080/v1 - **Required** * `--eval-serving-model-name`: The name of the model to use for evaluation. **Required** * `--eval-serving-model-api-key`: The API key for the model to evaluate. `EVAL_SERVING_MODEL_API_KEY` environment variable can be used as well. **Required** Signed-off-by: Sébastien Han <seb@redhat.com>

If someone pushes code changes and the pipeline.yaml has not been updated accordingly, the CI will fail and ask to update it and push again. Also, you can now generate a pipeline with `make pipeline`. Signed-off-by: Sébastien Han <seb@redhat.com>

The tarball located on S3 must contains both SDG data and the model to train. Signed-off-by: Sébastien Han <seb@redhat.com>

standalone/standalone.tpl

astefanutti · 2024-10-10T07:14:20Z

standalone/standalone.py

+                  export TRITON_CACHE_DIR=/tmp
+                  export HF_HOME=/tmp
+                  export TRANSFORMERS_CACHE=/tmp
+                  torchrun --nnodes {nnodes} \


That doesn't seem fully equivalent to running the run_main_ds.py, e.g. for options validation. Maybe the logic from the run_training in instructlab.training.main_ds could be used?

I'm deferring this to @MichaelClifford who originally implemented this code.

standalone/standalone.py

Sorry, I went really far with this one but I can confirm that: * sdg-data-fetch is working * data processing works * training phase 1 is working * training phase 2 is working Also: * remove backtick from the code since it breaks the shell that runs the python executor * only use a single PVC for everything: sdg data, model, trained model * --force-pull: to force pulling from the object store again if the data are already present Signed-off-by: Sébastien Han <seb@redhat.com>

The pipeline excepts to have all the functions at its disposal to run. So helper packages cannot be used. In this case, helpers only works because the current eval is using a custom image. Let's move all the logic inside the component. Signed-off-by: Sébastien Han <seb@redhat.com>

Signed-off-by: Sébastien Han <seb@redhat.com>

MichaelClifford

LGTM!

Thanks @leseb! I've tested this successfully through SDG downloading and training. Going to merge this now. Any other changes we can make in a follow up PR 😃

leseb requested a review from MichaelClifford October 8, 2024 20:42

leseb force-pushed the sdg-to-s3 branch from e04c4e0 to 85e8522 Compare October 9, 2024 07:55

leseb changed the title ~~feat: remove dependency on KFP lib~~ feat: remove dependency on KFP lib + eval serving endpoint support + ci Oct 9, 2024

leseb force-pushed the sdg-to-s3 branch 4 times, most recently from ffc54d0 to 42bcfb4 Compare October 9, 2024 14:01

leseb mentioned this pull request Oct 9, 2024

Endpoint field in SDG S3 Secret is optional #78

Merged

astefanutti mentioned this pull request Oct 9, 2024

Remove training PVC from SDG Job #79

Closed

sallyom reviewed Oct 9, 2024

View reviewed changes

leseb added 4 commits October 9, 2024 23:17

feat: remove dependency on KFP lib

195862b

We now use a generic python executor to call the eval and any DSL component functions. They are almost identical, only the signature is updated to not use KFP Classes. Signed-off-by: Sébastien Han <seb@redhat.com>

fix: clarify the structure of the initial tarball

ca03343

The tarball located on S3 must contains both SDG data and the model to train. Signed-off-by: Sébastien Han <seb@redhat.com>

leseb force-pushed the sdg-to-s3 branch 2 times, most recently from 3612fe7 to a645a1e Compare October 9, 2024 21:39

MichaelClifford reviewed Oct 9, 2024

View reviewed changes

standalone/standalone.tpl Outdated Show resolved Hide resolved

leseb force-pushed the sdg-to-s3 branch 2 times, most recently from d5de14c to 3e5d1fc Compare October 10, 2024 07:01

astefanutti reviewed Oct 10, 2024

View reviewed changes

leseb force-pushed the sdg-to-s3 branch 4 times, most recently from 36ba1f9 to 8a1f082 Compare October 10, 2024 08:07

leseb force-pushed the sdg-to-s3 branch 2 times, most recently from a257f06 to 41990c7 Compare October 10, 2024 10:20

leseb force-pushed the sdg-to-s3 branch from 41990c7 to 2eb8e1c Compare October 10, 2024 10:24

misc: use the correct images

8a3cef9

Signed-off-by: Sébastien Han <seb@redhat.com>

MichaelClifford approved these changes Oct 10, 2024

View reviewed changes

MichaelClifford merged commit 5c6483d into redhat-et:main Oct 10, 2024
1 check passed

leseb deleted the sdg-to-s3 branch October 10, 2024 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: remove dependency on KFP lib + eval serving endpoint support + ci #76

feat: remove dependency on KFP lib + eval serving endpoint support + ci #76

leseb commented Oct 8, 2024 •

edited

Loading

sallyom Oct 9, 2024 •

edited

Loading

leseb Oct 10, 2024

leseb Oct 10, 2024

astefanutti Oct 10, 2024

leseb Oct 10, 2024

MichaelClifford left a comment

feat: remove dependency on KFP lib + eval serving endpoint support + ci #76

feat: remove dependency on KFP lib + eval serving endpoint support + ci #76

Conversation

leseb commented Oct 8, 2024 • edited Loading

sallyom Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

leseb Oct 10, 2024

Choose a reason for hiding this comment

leseb Oct 10, 2024

Choose a reason for hiding this comment

astefanutti Oct 10, 2024

Choose a reason for hiding this comment

leseb Oct 10, 2024

Choose a reason for hiding this comment

MichaelClifford left a comment

Choose a reason for hiding this comment

leseb commented Oct 8, 2024 •

edited

Loading

sallyom Oct 9, 2024 •

edited

Loading