Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: remove dependency on KFP lib + eval serving endpoint support + ci #76

Merged
merged 7 commits into from
Oct 10, 2024

Conversation

leseb
Copy link
Collaborator

@leseb leseb commented Oct 8, 2024

195862b feat: remove dependency on KFP lib
7d8d314 feat: add serving endpoint details for eval
6852064 ci: add a check to ensure the pipeline is up to date
ca03343 fix: clarify the structure of the initial tarball
2b9d01e bulk commit
2eb8e1c fix: eval, do not use external deps
8a3cef9 misc: use the correct images

commit 195862b
Author: Sébastien Han seb@redhat.com
Date: Tue Oct 8 22:41:10 2024 +0200

feat: remove dependency on KFP lib

We now use a generic python executor to call the eval and any DSL
component functions. They are almost identical, only the signature is
updated to not use KFP Classes.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit 7d8d314
Author: Sébastien Han seb@redhat.com
Date: Wed Oct 9 09:54:26 2024 +0200

feat: add serving endpoint details for eval

During the initial MT-Bench eval, we must connect to a serving endpoint.
The script has been updated with new required flags:

* `--eval-serving-endpoint`: Serving endpoint for evaluation. e.g:
  http://serving.kubeflow.svc.cluster.local:8080/v1 - **Required**
* `--eval-serving-model-name`: The name of the model to use for evaluation. **Required**
* `--eval-serving-model-api-key`: The API key for the model to evaluate. `EVAL_SERVING_MODEL_API_KEY`
  environment variable can be used as well. **Required**

Signed-off-by: Sébastien Han <seb@redhat.com>

commit 6852064
Author: Sébastien Han seb@redhat.com
Date: Wed Oct 9 09:48:25 2024 +0200

ci: add a check to ensure the pipeline is up to date

If someone pushes code changes and the pipeline.yaml has not been
updated accordingly, the CI will fail and ask to update it and push
again.

Also, you can now generate a pipeline with `make pipeline`.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit ca03343
Author: Sébastien Han seb@redhat.com
Date: Wed Oct 9 14:40:27 2024 +0200

fix: clarify the structure of the initial tarball

The tarball located on S3 must contains both SDG data and the model to
train.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit 2b9d01e
Author: Sébastien Han seb@redhat.com
Date: Wed Oct 9 23:14:44 2024 +0200

bulk commit

Sorry, I went really far with this one but I can confirm that:

* sdg-data-fetch is working
* data processing works
* training phase 1 is working
* training phase 2 is working

Also:

* remove backtick from the code since it breaks the shell that runs the
  python executor
* only use a single PVC for everything: sdg data, model, trained model
* --force-pull: to force pulling from the object store again if the data
  are already present

Signed-off-by: Sébastien Han <seb@redhat.com>

commit 2eb8e1c
Author: Sébastien Han seb@redhat.com
Date: Thu Oct 10 12:02:14 2024 +0200

fix: eval, do not use external deps

The pipeline excepts to have all the functions at its disposal to run.
So helper packages cannot be used.
In this case, helpers only works because the current eval is using a
custom image.
Let's move all the logic inside the component.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit 8a3cef9
Author: Sébastien Han seb@redhat.com
Date: Thu Oct 10 12:49:27 2024 +0200

misc: use the correct images

Signed-off-by: Sébastien Han <seb@redhat.com>

@leseb leseb changed the title feat: remove dependency on KFP lib feat: remove dependency on KFP lib + eval serving endpoint support + ci Oct 9, 2024
@leseb leseb force-pushed the sdg-to-s3 branch 4 times, most recently from ffc54d0 to 42bcfb4 Compare October 9, 2024 14:01
--namespace my-namespace \
--eval-serving-endpoint http://serving.kubeflow.svc.cluster.local:8080/v1 \
--eval-serving-model-name my-model \
--eval-serving-model-api-key ***** \
Copy link
Collaborator

@sallyom sallyom Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canthis be judge-serving instead of eval-serving? ilab uses the term judge_model and --judge-model (CLI doesn't yet support external endpoint or api_key though)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--judge-serving-model-name
--judge-serving-endpoint
--judge-serving-model-api-key

We now use a generic python executor to call the eval and any DSL
component functions. They are almost identical, only the signature is
updated to not use KFP Classes.

Signed-off-by: Sébastien Han <seb@redhat.com>
During the initial MT-Bench eval, we must connect to a serving endpoint.
The script has been updated with new required flags:

* `--eval-serving-endpoint`: Serving endpoint for evaluation. e.g:
  http://serving.kubeflow.svc.cluster.local:8080/v1 - **Required**
* `--eval-serving-model-name`: The name of the model to use for evaluation. **Required**
* `--eval-serving-model-api-key`: The API key for the model to evaluate. `EVAL_SERVING_MODEL_API_KEY`
  environment variable can be used as well. **Required**

Signed-off-by: Sébastien Han <seb@redhat.com>
If someone pushes code changes and the pipeline.yaml has not been
updated accordingly, the CI will fail and ask to update it and push
again.

Also, you can now generate a pipeline with `make pipeline`.

Signed-off-by: Sébastien Han <seb@redhat.com>
The tarball located on S3 must contains both SDG data and the model to
train.

Signed-off-by: Sébastien Han <seb@redhat.com>
@leseb leseb force-pushed the sdg-to-s3 branch 2 times, most recently from 3612fe7 to a645a1e Compare October 9, 2024 21:39
@leseb leseb force-pushed the sdg-to-s3 branch 2 times, most recently from d5de14c to 3e5d1fc Compare October 10, 2024 07:01
export TRITON_CACHE_DIR=/tmp
export HF_HOME=/tmp
export TRANSFORMERS_CACHE=/tmp
torchrun --nnodes {nnodes} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem fully equivalent to running the run_main_ds.py, e.g. for options validation. Maybe the logic from the run_training in instructlab.training.main_ds could be used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm deferring this to @MichaelClifford who originally implemented this code.

standalone/standalone.py Outdated Show resolved Hide resolved
standalone/standalone.py Outdated Show resolved Hide resolved
@leseb leseb force-pushed the sdg-to-s3 branch 4 times, most recently from 36ba1f9 to 8a1f082 Compare October 10, 2024 08:07
Sorry, I went really far with this one but I can confirm that:

* sdg-data-fetch is working
* data processing works
* training phase 1 is working
* training phase 2 is working

Also:

* remove backtick from the code since it breaks the shell that runs the
  python executor
* only use a single PVC for everything: sdg data, model, trained model
* --force-pull: to force pulling from the object store again if the data
  are already present

Signed-off-by: Sébastien Han <seb@redhat.com>
@leseb leseb force-pushed the sdg-to-s3 branch 2 times, most recently from a257f06 to 41990c7 Compare October 10, 2024 10:20
The pipeline excepts to have all the functions at its disposal to run.
So helper packages cannot be used.
In this case, helpers only works because the current eval is using a
custom image.
Let's move all the logic inside the component.

Signed-off-by: Sébastien Han <seb@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Copy link
Collaborator

@MichaelClifford MichaelClifford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks @leseb! I've tested this successfully through SDG downloading and training. Going to merge this now. Any other changes we can make in a follow up PR 😃

@MichaelClifford MichaelClifford merged commit 5c6483d into redhat-et:main Oct 10, 2024
1 check passed
@leseb leseb deleted the sdg-to-s3 branch October 10, 2024 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants