Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Github Throttling pluging for @octokit/rest #3983

Open
andrewdibiasio6 opened this issue Jul 11, 2024 · 7 comments
Open

Use Github Throttling pluging for @octokit/rest #3983

andrewdibiasio6 opened this issue Jul 11, 2024 · 7 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@andrewdibiasio6
Copy link

andrewdibiasio6 commented Jul 11, 2024

GitHub limits the number of REST API requests that you can make within a specific amount of time.

We authorize a GitHub App or OAuth app, which can then make API requests on your behalf. All of these requests count towards a personal rate limit of 5,000 requests per hour.

In addition to primary rate limits, GitHub enforces secondary rate limits in order to prevent abuse and keep the API available for all users.

We may encounter a secondary rate limit if we:

Make too many concurrent requests. No more than 100 concurrent requests are allowed.
Make too many requests to a single endpoint per minute. No more than 900 [points](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#calculating-points-for-the-secondary-rate-limit) per minute are allowed for REST API endpoints
Make too many requests per minute. No more than 90 seconds of CPU time per 60 seconds of real time is allowed.

We are seeing many errors like:

{"level":"ERROR","message":"Request failed with status code 403","service":"runners-pool","timestamp":"2024-07-11T14:10:21.576Z",
{"level":"WARN","message":"Ignoring error: Request failed with status code 503","service":"runners-scale-up","timestamp":"2024-07-22T18:32:03.979Z","xray_trace_id":"1-669ea597-fcd409efe9e843d7a70dc3d6" ... }

I suggest we add the suggested throttling plugin to help with this issue, or some other suggestion here.

@npalm
Copy link
Member

npalm commented Jul 12, 2024

Would be a good addition. But it will not solve the rate limit problem. May I ask what size of org / deployments do you have?

@andrewdibiasio6
Copy link
Author

andrewdibiasio6 commented Jul 23, 2024

But it will not solve the rate limit problem.

@npalm I was able to solve most of the rate limiting problems that were occurring almost ever hour by varying the scheduled lambda event. The pool docs suggest the following: schedule_expression = "cron(* * * * ? *)" . This, in my opinion, is not a great suggestion. It will almost certainly result in rate limiting if you have more than a couple runner configurations because github will throttle you due to concurrent requests. To resolve this, I first varied the schedule expressions across runners like so:
schedule_expression: cron(0/2 * * * ? *) schedule_expression: cron(1/2 * * * ? *)

This reduces the overall concurrent requests to github, and resolves most of the throttling issues. I think this should be added to the docs.

May I ask what size of org / deployments do you have?

We have one deployment with 21 runners.

Because of the resolution above, we decided to remove pools all together as they are expensive. Once we removed pools we noticed that every so often a job will never be allocated a runner. When looking into the logs, I see an error around the same time:

{"level":"WARN","message":"Ignoring error: Request failed with status code 503","service":"runners-scale-up","timestamp":"2024-07-22T18:32:03.979Z","xray_trace_id":"1-669ea597-fcd409efe9e843d7a70dc3d6" ... }

The job will now hang forever. I believe this happens because of the size of some of our workflow's matrix jobs. It launches around 25 jobs in parallel. So far, we have only noticed this error for this specific workflow.

The workaround for us is to always ensure there is one runner available at all times, so we have to add in a pool of size 1 to all our runners. Obviously this isn't ideal. I am not sure if I have anymore control on how philips will process my requests, turning down scale_up_reserved_concurrent_executions: 5 is very slow. In my opinion, the github client should wait and retry these errors a few times before giving up. Thoughts?

Also, see updated overview of this issue. You can see the original 403 error was from the pool lambda, since resolved, but new 503 is from the scale up lambda, which makes sense since that would be getting all the parallel requests from my job matrix.

@stuartp44 stuartp44 added enhancement New feature or request question Further information is requested labels Aug 6, 2024
@andrewdibiasio6
Copy link
Author

@npalm According to the github rate limiting error docs linked in the error message, if we keep retrying requests, we will be limited further, or our app will be banned. We are still seeing this issue and when it happens, we are limited for multiple hours and we can't request any runners.

@kgoralski
Copy link

kgoralski commented Sep 20, 2024

Can maxReceiveCount: 100 contribute to higher number of requests to Github?
If yes, then having many fleet types and high maxReceiveCount can probably exceed the limit easily?

@npalm
Copy link
Member

npalm commented Oct 3, 2024

@andrewdibiasio6 the module is now supporting a job retry mechanism, which will solve teh problem for some hagning jobs

@andrewdibiasio6
Copy link
Author

@npalm Yes this would solve the issue for some hanging jobs, but 900s upper bound for retires isn't going to help. When throttled by github, you're usually throttled for 1 hour. This means no amount for retires will help. If anything, retrying more will likely throttle you more, as giuthub suggestion is to back off for a suggested amount of time before retrying, hence using the octo client.

@npalm
Copy link
Member

npalm commented Oct 9, 2024

The intend of the retry are mostly messages that are missed, messages getting crossed and not scaling properly. Indeed 900 is the max for SQS. Ideas or help is very welcom to make the runners. more resilient. But the tough part quering GitHub to find jobs will only add up to rate limit. Also GitHub does not have an API to ask the depth of queus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants