-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[@parallel][jobsets] bug fixes and refactor #1900
Conversation
626b61e
to
a406b64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: make separate PRs for anything in the core for examples files in parallel_decorator
and runtime.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: add a property to current
for signifying if it's inside a @parallel
step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixme: remove the current.parallel
property from the metaflow_current.py
file and instead dynamically inject the property inside the @parallel
decorator. [Can be done at a later date. At least add a comment in the code. ] [DONE}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: validate this with @Batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: make a PR that has all k8s changes. and another PR which has all the core changes.
run_id=None, | ||
step_name=None, | ||
task_id=None, | ||
flow_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: at some point re-unify the code bases for Jobset and native jobs.
step_cli.replace( | ||
METAFLOW_PARALLEL_STEP_CLI_OPTIONS_TEMPLATE, | ||
"--ubf-context $UBF_CONTEXT --split-index %s --task-id %s" | ||
% (index, _tskid), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: try of .format
instead of .replace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.format
doesnt work as stated earlier for the following reason:
The point in time we are running the .format
, we already have constructed the step_cli with more format strings; For example at this point the step_cli looks like :
python -u fail_flow.py --quiet --metadata service --environment local --datastore s3 --event-logger nullSidecarLogger --monitor nullSidecarMonitor --datastore-root s3://mybcket/metaflow/ --package-suffixes .py,.R,.RDS --pylint step train --run-id 5075 --input-paths ${METAFLOW_INPUT_PATHS_0} --retry-count 0 --max-user-code-retries 2 --namespace valay {METAFLOW_PARALLEL_STEP_CLI_OPTIONS_TEMPLATE}
At this point the step_cli already contains the ${METAFLOW_INPUT_PATHS_0}
which is a shell variable that will be replaced at runtime. At this point if we use .format
then python will crash complaining that we haven't given any value for format string METAFLOW_INPUT_PATHS_0
. So we use replace instead of format.
c57d925
to
0ff7f9a
Compare
- fix tests for interplay of @Secrets and @parallel - Local runtime will allow secrets only on worker tasks - Secrets will be set on all kinds of tasks when run remotely (control/worker) - fix tests for ubf based on new changes to core. - fix tests for tag-catch for ubf based on new changes to core. - internal ubf decorator has a internal_task_type set to it. - [feedback] register metadata in parallel decorator - [feedback]@parallel inject in current: - move `current.parallel` from `metaflow_current` to `parallel_decorator` - [feedback] appropariately setting task-metadata for parallel stuff - The 'world size' metadata will be set in the @parallel decorator. - The 'node-index' metadata, however, varies depending on the type of computing environment executing the task so it will be set in the appropriate compute decorators. - One reason to specify 'node-index' within compute decorators is that the parallel implementation in the compute decorator might not directly set the required environment variables (`MF_PARALLEL_*`). Instead, these values may be established during the `task_pre_step` phase of the compute decorator using other environment variables set during the implementation. - adding some aws batch changes - [feedback] safety check for _parallel_buf_iter in task_pre_step for @parallel - [feedback] set `is_parallel` in current to denote a step is running under an `@parallel` decorator.
- Some changes stemming from Netflix#1854 - reorder imports - changes to env var prefixes + id generation - Abstraction over Jobset Spec - Create general abstraction and copy style of how we create jobspec for native k8s jobs. - simplified the implementation. - similar pattern/trend to follow for argo - remove older implementation - @parallel metadata from k8s decorator - added attempt to MetaDatum tags - We didn't have this earlier
0ff7f9a
to
896474f
Compare
Review commits individually to see core changes.