Timeouts

Many load balancers have 60s configured as default timeout. Our API timeouts are designed to work within these bounds.

Preface: How network reaches Rivet

Client -> Cloudflare -> NLB -> Traefik -> api-monolith

These are the timeouts that our API servers are restricted to:

Cloudflare: 100s (source)
- Behavior Returns a 524
- Cannot be configured unless paying for Cloudflare Enterprise
AWS NAT Gateway: 350 seconds idle (without keepalive) (source)
- Behavior Connection drop
AWS NLB: 350 seconds (source)
- Behavior Connection drop
Traefik: 60s, 120s (source)
- Behavior Unknown
- Unlike the other timeouts, this is configurable by us
- 60s timeout for active requests before traefik stops
- 120s timeout for reading the body and writing the response
ATS (Through Traefik): 15s (source)
- Behavior Unknown

We use long polling (i.e. watch_index) to implement real time functionality. This means we need to be cautious about existing timeouts.

Current timeouts:

api-helper: 50s (source)
- Behavior Returns API_REQUEST_TIMEOUT
- Motivation This gives a 10s budget for any other 60s timeout
select_with_timeout!: 40s (source)
- Behavior Timeout handled by API endpoint, usually 200
- Motivation This gives a 10s budget for any requests before/after the select statement
tail! and tail_all!: 40s (depending on TailAllConfig) (source)
- Behavior Timeout handled by API endpoint, usually 200
- Motivation This gives a 10s budget for any requests before/after the select statement

idle_timeout is set to 3 minutes, which is less than the NAT Gateway timeout
test_before_acquire is left as true in order to ensure we don't run in to timeouts, even though this adds significant overhead

We ping the database manually every 15 seconds
Back off retries is set to infinity in order to ensure that ConnectionManager always returns to a valid state no matter the connection issues
- The current internal logic will cause the Redis connection to fail after 6 automatic disconnects, which will cause the cluster to fail if idle for too long

Implementing long-running TCP Connections within VPC networking
Introducing configurable Idle timeout for Connection tracking (this is intentionally not configured)