GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

dmcguire81 · 2024-08-30T06:09:01Z

Here's a short example flow to illustrate the issue:

from metaflow import FlowSpec, step, resources


class NonGPUFlow(FlowSpec):
    @step
    def start(self):
        print("Starting flow")
        self.next(self.end)

    # this step hangs indefinitely without any feedback,
    # and no indication of error such as PodUnschedulable,
    # because nothing can statisfy the constraint gpu=0
    @resources(cpu=2, memory=2048)
    @step
    def end(self):
        print("Flow completed")


if __name__ == "__main__":
    NonGPUFlow()

When this is run on Kubernetes, as follows, the end step can never be allocated resources:

$ python NonGPUFlow.py run --with kubernetes
Metaflow 2.12.10+netflix-ext(1.2.1) executing NonGPUFlow for user:davidmcguire
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2024-08-29 22:35:30.205 Workflow starting (run-id 49):
2024-08-29 22:35:34.857 [49/start/240 (pid 6056)] Task is starting.
2024-08-29 22:35:37.488 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task is starting (Pod is pending)...
2024-08-29 22:37:06.836 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Setting up task environment.
2024-08-29 22:37:32.401 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Downloading code package...
2024-08-29 22:37:33.310 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Code package downloaded.
2024-08-29 22:37:33.411 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task is starting.
2024-08-29 22:37:35.124 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Starting flow
2024-08-29 22:37:40.147 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task finished with exit code 0.
2024-08-29 22:37:41.789 [49/start/240 (pid 6056)] Task finished successfully.
2024-08-29 22:37:43.602 [49/end/241 (pid 15724)] Task is starting.
2024-08-29 22:37:46.470 [49/end/241 (pid 15724)] [job t-0ed12c7d-9dwb2] Task is starting (Job status is unknown)...

This will never make any progress against GKE, nor will it report a failure mode.

The unsatisfiable constraint is specified in the portion of the Spec for the step that is communicated to Kubernetes, shown below (showing the default value for KUBERNETES_GPU_VENDOR):

        resources:
          limits:
            cloud.google.com/pod-slots: "1"
            nvidia.com/gpu: "0"
          requests:
            cloud.google.com/pod-slots: "1"
            cpu: "2"
            ephemeral-storage: 10240M
            memory: 4096M
            nvidia.com/gpu: "0"

The root cause turns out to be a combination of a default value of 0 in the ResourcesDecorator, and the fact that the KubernetesDecorator only filters for None, which is not the default value.

There is a work-around of explicitly setting gpu=None in the @resources decorator, but forgetting to do this makes for an unpleasant footgun that gives no hints as to the underlying problem. Because the behavior of the KubernetesDecorator can be corrected without considering what the impact on the BatchDecorator would be (which would be the case if the default for gpu in the ResourcesDecorator were changed), this seems like an easy win for usability.

The text was updated successfully, but these errors were encountered:

saikonen · 2024-10-15T17:11:11Z

The issue is relevant and the PR solving this is going in the right direction. I'm currently looking into why the default gpu value for resources is "0" to see if that could be changed to None instead.

One thing that is bothering me with this is that when I tested on EKS, and the 0 gpu resource requests do not make the pod unschedulable. Do you happen to know if resources are treated differently on GKE for whatever reason?

saikonen · 2024-10-15T17:13:57Z

The other thing speaking for None being a default is that we do not actually want to add non-requests into the resources spec, as in the case of requesting gpu's (even zero) this can lead to the pods being scheduled on gpu instances

discussion on that here
https://outerbounds-community.slack.com/archives/C02116BBNTU/p1728569753648779

dmcguire81 · 2024-10-15T18:07:28Z

The issue is relevant and the PR solving this is going in the right direction. I'm currently looking into why the default gpu value for resources is "0" to see if that could be changed to None instead.

One thing that is bothering me with this is that when I tested on EKS, and the 0 gpu resource requests do not make the pod unschedulable. Do you happen to know if resources are treated differently on GKE for whatever reason?

My guess would be that it's not part of whatever (de facto) spec there is for Kubernetes, because it's a non-sensical request, and AWS and GCP happen to have implemented different behaviors.

dmcguire81 · 2024-10-15T18:08:24Z

The other thing speaking for None being a default is that we do not actually want to add non-requests into the resources spec, as in the case of requesting gpu's (even zero) this can lead to the pods being scheduled on gpu instances

discussion on that here https://outerbounds-community.slack.com/archives/C02116BBNTU/p1728569753648779

Concur that's the better fix, but there are more moving parts, and I was less confident I could execute on it safely.

dmcguire81 mentioned this issue Aug 30, 2024

Only use a GPU value for Kubernetes that is non-null and non-zero #2006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

dmcguire81 commented Aug 30, 2024 •

edited

Loading

saikonen commented Oct 15, 2024

saikonen commented Oct 15, 2024

dmcguire81 commented Oct 15, 2024

dmcguire81 commented Oct 15, 2024

GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

Comments

dmcguire81 commented Aug 30, 2024 • edited Loading

saikonen commented Oct 15, 2024

saikonen commented Oct 15, 2024

dmcguire81 commented Oct 15, 2024

dmcguire81 commented Oct 15, 2024

dmcguire81 commented Aug 30, 2024 •

edited

Loading