fix: Ignore pod patching failures during startup #397

kppullin · 2024-08-20T16:16:53Z

fix: Ignore pod patching failures during startup

Note: this is a copy of #392, which auto-closed when I renamed my branch in an attempt to follow conventionalcommits.org

Summary

Patch failures in pod_webhook.Start() abort the controller process and may result in crash looping.

We've experienced this on a test cluster, where we have 8 pods out of ~1100 that fail return an error on the "no op" patch attempt in Start().

As I believe the code within Start() is an optimization to eagerly create CachedImages, but otherwise not required to function properly, this patch will instead log the err and continue on to the next pod.

Patch Failure Details

Here's a sample of the patch failure:

The Pod "xxx-yyy-zzz" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`,`spec.initContainers[*].image`,`spec.activeDeadlineSeconds`,`spec.tolerations` (only additions to existing tolerations),`spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
  core.PodSpec{
  	... // 15 identical fields
  	Subdomain:         "",
  	SetHostnameAsFQDN: nil,
  	Affinity: &core.Affinity{
  		NodeAffinity: &core.NodeAffinity{
  			RequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{
  				NodeSelectorTerms: []core.NodeSelectorTerm{
  					{
  						MatchExpressions: []core.NodeSelectorRequirement{
  							{Key: "xxx/availability", Operator: "In", Values: {"on-demand"}},
  							{Key: "xxx/arch", Operator: "In", Values: {"amd64"}},
+ 							{Key: "xxx/availability", Operator: "In", Values: []string{"on-demand"}},
  						},

I've no idea why the patch logic occasionally is converting from Values: {"on-demand"} to Values: []string{"on-demand"} only on a subset of pods. A first thought was the match expressions objects being reordered & sorted based on the Key, however I was unable to replicate the error. This style of Affinity config exists in many other pods in the cluster.

monkeynator · 2024-10-09T13:41:39Z

🎉 This PR is included in version 1.10.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

fix: ignore pod patching failures during startup

a85319d

kppullin force-pushed the fix/ignore-pod-patching-failures-during-startup branch from 66fbac1 to a85319d Compare August 20, 2024 16:20

kppullin mentioned this pull request Aug 20, 2024

Controller crashes as soon as it starts #371

Open

Nicolasgouze merged commit d3bc9f1 into enix:main Aug 27, 2024
11 checks passed

kppullin deleted the fix/ignore-pod-patching-failures-during-startup branch September 16, 2024 18:06

kppullin mentioned this pull request Oct 3, 2024

Controller crashes on pods with Linkerd sidecar proxy containers #419

Open

monkeynator added the released label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Ignore pod patching failures during startup #397

fix: Ignore pod patching failures during startup #397

kppullin commented Aug 20, 2024

monkeynator commented Oct 9, 2024

fix: Ignore pod patching failures during startup #397

fix: Ignore pod patching failures during startup #397

Conversation

kppullin commented Aug 20, 2024

Summary

Patch Failure Details

monkeynator commented Oct 9, 2024