-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a uid for the kubernetes job creation #1588
Conversation
This is quite serious. Running only 25000 jobs within a week results in a 95% chance of error due to job name collisions. We have run into this problem many times already, and it is super hard to debug because the join-step that fails does not tell you why. Foot notes:
n_possible_job_names = 36**5
n_jobs = 25000
1 - np.prod([(n_possible_job_names - j) / n_possible_job_names for j in range(n_jobs)])
Out[]: 0.9943079470622059 |
Testing[528] @ 22ba73f |
Testing[528] @ 22ba73f had 6 FAILUREs. |
@@ -178,7 +180,7 @@ def create_job( | |||
job = ( | |||
KubernetesClient() | |||
.job( | |||
generate_name="t-", | |||
generate_name="t-{entropy}".format(entropy=secrets.token_hex(2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create_job
receives secrets
as a parameter, so this won't work. Small change to directly import and use token_hex
should fix it.
from secrets import token_hex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work either if Metaflow still plans on supporting python 3.5.x
Running
mamba create -p /tmp/test 'python==3.5.5'
mamba activate /tmp/test
python -c "import secrets"
Results in module not found
Went back to UUID to maintain backwards compatibility with python 3.5. Using 3 characters since UUID doesn't have as large a character set as secrets |
If I am not mistaken, the character set is the same. Both things are hexadecimal. |
I really really appreciate that all of you are pushing this forward. High five 🙌 Apologies, if you find me annoying, I am really just trying to help and avoid future Github issues. As I say, I appreciate what you are doing. |
I was wondering the same, to be fair. It is still not clear to me why a unix timestamp wouldn't just solve the issue, to be honest, If you don't want to take that approach though, surely using more characters from the UID would be preferable? |
@elgehelge Not finding you annoying at all! Looks like I made 2 missteps when pushing my latest changes.
It was actually my attempt to increase rather than decrease when using 3 characters
I spoke with @savingoyal yesterday and now understand that changes to the codebase need to be carefully considered in order to maintain the stability of the project, particularly because Metaflow interacts with so many external systems (AWS Batch, Kubernetes, local machines, etc.) An example of a thorny problem when it comes to UID length is an active issue with Argo Workflows: #1521. A timestamp would bump us up to 10 characters, and it would limit our character set to numbers (10 possibilities vs 16 per character.) Looks like @saikonen bumped the UID up to 8 characters yesterday, so we should be good to go with entropy |
Fixes #1554
When running large for-each branches (3000+) Metaflow will sometimes cause job name collisions on kubernetes. This adds a UUID to the prefix which should add sufficient entropy to prevent this from happening.