Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"on-demand" and "spot" EC2 instances in a single stack/GH-app for multi-runner? #4138

Open
cisco-sbg-mgiassa-ai opened this issue Sep 18, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@cisco-sbg-mgiassa-ai
Copy link

cisco-sbg-mgiassa-ai commented Sep 18, 2024

Good day,

Would it be realistically possible to setup an instance of multi-runner, and have the "spot-versus-on-demand" EC2 settings be per runner type, rather than some global setting for the entire GitHub App (i.e. stack/module-instance)? About 95% of the time, spot works great, but there are some CICD jobs where getting hit w/ node eviction can be quite painful (especially if it happens multiple times per day in a busy region during peak usage hours). It would be extremely helpful to be able to just set an addition runs-on flag and call it a day.

@npalm
Copy link
Member

npalm commented Oct 2, 2024

Currently you can set via instance_target_capacity_type per runner type to use spot or on-demand. Which is not flexible. The module also allows to create on-demand if spot fails. But indeed there is nothing in place in case the job fails.

Options that could be investigated could be

  • Move automatically to on-demand if spot-failures is hitting a treshhold
  • Allow a dynamic label in runs-on to indicate a job needs to run on on-demand always
  • ...?

@stuartp44 stuartp44 added the enhancement New feature or request label Oct 3, 2024
@crohr
Copy link

crohr commented Oct 11, 2024

@cisco-sbg-mgiassa-ai you probably want something like what RunsOn provides, with labels that allow dynamic runner configuration at runtime: https://runs-on.com/configuration/job-labels/#spot

@crohr
Copy link

crohr commented Oct 11, 2024

Move automatically to on-demand if spot-failures is hitting a treshhold

@npalm do you mean the failures to start spot instances (due to e.g. quota issues), or listening to spot eviction events and avoid launching in spot mode if too many of them occur?

In the latter case, I've had trouble finding a proper way to get those events in close to real-time. In CloudTrail they are usually delayed by up to 15 minutes, which might be too late.

Another option would be to catch the event from the VM, and ping the control plane when this happens.

@npalm
Copy link
Member

npalm commented Oct 17, 2024

We have added in one of the latest releases a lambda that can log / metric spot termination instead of warning as well. The lambda acting on the warning should be near real time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants