-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005
Comments
The issue is relevant and the PR solving this is going in the right direction. I'm currently looking into why the default gpu value for resources is One thing that is bothering me with this is that when I tested on EKS, and the 0 gpu resource requests do not make the pod unschedulable. Do you happen to know if resources are treated differently on GKE for whatever reason? |
The other thing speaking for discussion on that here |
My guess would be that it's not part of whatever (de facto) spec there is for Kubernetes, because it's a non-sensical request, and AWS and GCP happen to have implemented different behaviors. |
Concur that's the better fix, but there are more moving parts, and I was less confident I could execute on it safely. |
Here's a short example flow to illustrate the issue:
When this is run on Kubernetes, as follows, the
end
step can never be allocated resources:This will never make any progress against GKE, nor will it report a failure mode.
The unsatisfiable constraint is specified in the portion of the Spec for the step that is communicated to Kubernetes, shown below (showing the default value for KUBERNETES_GPU_VENDOR):
The root cause turns out to be a combination of a default value of
0
in theResourcesDecorator
, and the fact that theKubernetesDecorator
only filters forNone
, which is not the default value.There is a work-around of explicitly setting
gpu=None
in the@resources
decorator, but forgetting to do this makes for an unpleasant footgun that gives no hints as to the underlying problem. Because the behavior of theKubernetesDecorator
can be corrected without considering what the impact on theBatchDecorator
would be (which would be the case if the default forgpu
in theResourcesDecorator
were changed), this seems like an easy win for usability.The text was updated successfully, but these errors were encountered: