Skip to content

Latest commit

 

History

History
76 lines (59 loc) · 3.66 KB

TIMEOUTS.md

File metadata and controls

76 lines (59 loc) · 3.66 KB

Timeouts

Many load balancers have 60s configured as default timeout. Our API timeouts are designed to work within these bounds.

Preface: How network reaches Rivet

Client -> Cloudflare -> NLB -> Traefik -> api-monolith

Infra timeouts

These are the timeouts that our API servers are restricted to:

  • Cloudflare: 100s (source)
    • Behavior Returns a 524
    • Cannot be configured unless paying for Cloudflare Enterprise
  • AWS NAT Gateway: 350 seconds idle (without keepalive) (source)
    • Behavior Connection drop
  • AWS NLB: 350 seconds (source)
    • Behavior Connection drop
  • Traefik: 60s, 120s (source)
    • Behavior Unknown
    • Unlike the other timeouts, this is configurable by us
    • 60s timeout for active requests before traefik stops
    • 120s timeout for reading the body and writing the response
  • ATS (Through Traefik): 15s (source)
    • Behavior Unknown

Rivet API Timeouts

We use long polling (i.e. watch_index) to implement real time functionality. This means we need to be cautious about existing timeouts.

Current timeouts:

  • api-helper: 50s (source)
    • Behavior Returns API_REQUEST_TIMEOUT
    • Motivation This gives a 10s budget for any other 60s timeout
  • select_with_timeout!: 40s (source)
    • Behavior Timeout handled by API endpoint, usually 200
    • Motivation This gives a 10s budget for any requests before/after the select statement
  • tail! and tail_all!: 40s (depending on TailAllConfig) (source)
    • Behavior Timeout handled by API endpoint, usually 200
    • Motivation This gives a 10s budget for any requests before/after the select statement

Database connections

CockroachDB

  • idle_timeout is set to 3 minutes, which is less than the NAT Gateway timeout
  • test_before_acquire is left as true in order to ensure we don't run in to timeouts, even though this adds significant overhead

Redis

  • We ping the database manually every 15 seconds
  • Back off retries is set to infinity in order to ensure that ConnectionManager always returns to a valid state no matter the connection issues
    • The current internal logic will cause the Redis connection to fail after 6 automatic disconnects, which will cause the cluster to fail if idle for too long

Misc Resources