Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ignore pod patching failures during startup #397

Conversation

kppullin
Copy link
Contributor

fix: Ignore pod patching failures during startup

Note: this is a copy of #392, which auto-closed when I renamed my branch in an attempt to follow conventionalcommits.org

Summary

Patch failures in pod_webhook.Start() abort the controller process and may result in crash looping.

We've experienced this on a test cluster, where we have 8 pods out of ~1100 that fail return an error on the "no op" patch attempt in Start().

As I believe the code within Start() is an optimization to eagerly create CachedImages, but otherwise not required to function properly, this patch will instead log the err and continue on to the next pod.

Patch Failure Details

Here's a sample of the patch failure:

The Pod "xxx-yyy-zzz" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`,`spec.initContainers[*].image`,`spec.activeDeadlineSeconds`,`spec.tolerations` (only additions to existing tolerations),`spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
  core.PodSpec{
  	... // 15 identical fields
  	Subdomain:         "",
  	SetHostnameAsFQDN: nil,
  	Affinity: &core.Affinity{
  		NodeAffinity: &core.NodeAffinity{
  			RequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{
  				NodeSelectorTerms: []core.NodeSelectorTerm{
  					{
  						MatchExpressions: []core.NodeSelectorRequirement{
  							{Key: "xxx/availability", Operator: "In", Values: {"on-demand"}},
  							{Key: "xxx/arch", Operator: "In", Values: {"amd64"}},
+ 							{Key: "xxx/availability", Operator: "In", Values: []string{"on-demand"}},
  						},

I've no idea why the patch logic occasionally is converting from Values: {"on-demand"} to Values: []string{"on-demand"} only on a subset of pods. A first thought was the match expressions objects being reordered & sorted based on the Key, however I was unable to replicate the error. This style of Affinity config exists in many other pods in the cluster.

@kppullin kppullin force-pushed the fix/ignore-pod-patching-failures-during-startup branch from 66fbac1 to a85319d Compare August 20, 2024 16:20
@Nicolasgouze Nicolasgouze merged commit d3bc9f1 into enix:main Aug 27, 2024
11 checks passed
@kppullin kppullin deleted the fix/ignore-pod-patching-failures-during-startup branch September 16, 2024 18:06
@monkeynator
Copy link
Member

🎉 This PR is included in version 1.10.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants