Detect offline clusters #2933

weyfonk · 2024-10-07T10:15:40Z

This adds a cluster status monitor to the Fleet controller, which checks when each cluster last saw its agent online. If more than the expected interval elapses, that cluster is considered offline, and the monitor updates its bundle deployments' statuses to reflect that. This will trigger status updates to bundles, GitRepos, clusters and cluster groups.

Refers to #594.

Open points:

how far, and how fine-grained, do we want to make bundle deployment status updates for offline clusters? This currently takes a fairly basic approach, updating both Ready and Monitored conditions while clearing modified and non-ready statuses, to prevent outdated messages from appearing in a bundle deployment's display status and further up the chain of status updates (to bundles, then upwards to GitRepos, etc)
should we make the frequency of monitoring configurable?
do we have a way to exclude the local/management cluster from agent-last-seen checks?

0xavi0

I don't know if you want to merge this PR with the comments with open questions, maybe better to clarify first.

Just a couple of observations, but overall LGTM

internal/cmd/controller/clustermonitor/monitor.go

internal/cmd/controller/clustermonitor/monitor_test.go

This allows the Fleet controller to detect offline clusters and update statuses of bundle deployments targeting offline clusters. Next to do: * understand how bundle deployment status updates should be propagated to bundles (which currently simply appear as `Modified`) and further up * write tests (eg. integration tests, updating cluster status by hand?) * set sensible defaults (eg. monitoring interval higher, and threshold higher than agent's cluster status reporting interval)

This enables that state to be reflected upwards in bundle, GitRepo, cluster and cluster group statuses.

This also provides unit tests for detecting offline clusters.

The frequency of cluster status checks is currently hard-coded to 1 minute, but could be made configurable. The threshold for considering a cluster offline now explicitly depends on how often agents report their statuses to the management cluster. Changes to that configured interval should impact the cluster status monitor, which would take the new value into account from its next run onwards.

This fixes an error and ignores a few others to make the linter happy.

This adds checks ensuring that for offline clusters, for which calls to update bundle deployment statuses are expected, those statuses contain `Ready` and `Monitored` conditions with status `False` and reasons reflecting the cluster's offline status.

This ensures that creating a new agent bundle fails with an agent check-in interval set to 0.

This adds a check on the agent check-in interval to cluster import, for consistency with agent bundle updates.

This enables users to determine how often the Fleet controller will check for offline clusters, and based on which threshold. If the configured threshold is below the triple of the check-in interval, that tripled value will be used instead.

This optimises updates to bundle deployments, running them only against clusters which bundle deployments are not yet marked as offline.

This better reflects what is then known about workloads running in such clusters than `False`.

This should fix Fleet controller deployments complaining about the interval being 0 when it should never be.

Running one cluster status monitor per Fleet controller pod is not necessary and may cause conflicts in sharded setups.

weyfonk requested a review from a team as a code owner October 7, 2024 10:15

0xavi0 reviewed Oct 8, 2024

View reviewed changes

internal/cmd/controller/clustermonitor/monitor.go Outdated Show resolved Hide resolved

internal/cmd/controller/clustermonitor/monitor_test.go Outdated Show resolved Hide resolved

weyfonk force-pushed the 594-detect-offline-cluster branch 2 times, most recently from 38a697b to 452e56e Compare October 8, 2024 09:51

kkaempf added the kind/enhancement label Oct 8, 2024

kkaempf modified the milestones: v2.10.0, v2.9.4 Oct 8, 2024

kkaempf added kind/bug and removed kind/enhancement labels Oct 8, 2024

weyfonk force-pushed the 594-detect-offline-cluster branch 2 times, most recently from fd85993 to a1c1b94 Compare October 18, 2024 11:26

weyfonk added 12 commits October 18, 2024 16:51

Reflect offline cluster state in more bundle deployment status fields

9fcf63d

This enables that state to be reflected upwards in bundle, GitRepo, cluster and cluster group statuses.

Move cluster status monitor to separate package

a97675e

This also provides unit tests for detecting offline clusters.

Eliminate linting errors

d7ddef5

This fixes an error and ignores a few others to make the linter happy.

Prevent agent check-in interval from being 0

356a2bb

This ensures that creating a new agent bundle fails with an agent check-in interval set to 0.

Prevent check-in interval from being 0 when importing cluster

b4274ab

This adds a check on the agent check-in interval to cluster import, for consistency with agent bundle updates.

Skip bundle deployment updates for already offline clusters

00d6ef0

This optimises updates to bundle deployments, running them only against clusters which bundle deployments are not yet marked as offline.

Set Ready condition to Unknown for offline clusters

34d1ae7

This better reflects what is then known about workloads running in such clusters than `False`.

Fix json attribute for cluster monitor interval

8a32810

This should fix Fleet controller deployments complaining about the interval being 0 when it should never be.

weyfonk force-pushed the 594-detect-offline-cluster branch from a1c1b94 to 8a32810 Compare October 18, 2024 14:51

weyfonk marked this pull request as draft October 18, 2024 16:06

Run cluster status monitor on unsharded controller only

a40ba06

Running one cluster status monitor per Fleet controller pod is not necessary and may cause conflicts in sharded setups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect offline clusters #2933

Detect offline clusters #2933

weyfonk commented Oct 7, 2024

0xavi0 left a comment

Detect offline clusters #2933

Are you sure you want to change the base?

Detect offline clusters #2933

Conversation

weyfonk commented Oct 7, 2024

0xavi0 left a comment

Choose a reason for hiding this comment