-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce addons cache #1239
Introduce addons cache #1239
Conversation
We see too many random failures on the lab running the e2e job. The most common error is a git fetch error when applying ocm-controller kustomization. Search shows that there is no way to avoid this error. This change introduces a cache for kustomization directories, keeping the output of `kustomize build kustomization_dir` locally, and applying the cached resource when starting the addon. Changes: - The start hook can now fetch the cahe and apply the cached resource. See ocm-controller for example. - Add new "fetch" hook, called using the `drenv fetch` command. This can be used to create the cache for all addons implementing this hook periodically (e.g in a cron daily job). See ocm-controller fetch hook for example. - Add new "clear" command to clear the cache. There is no automatic invalidation based on cache age. To cache builds itself automatically, but it has to be invalidated manually. To keep the cache fresh, you can run `drenv clear` and `drenv fetch` periodically from a cron job. The following examples demonstrate how to cache works. Clearing the cache: $ drenv clear envs/regional-dr.yaml 2024-03-12 22:51:37,001 INFO [rdr] Clearing cache 2024-03-12 22:51:37,002 INFO [rdr] Fetching finishied in 0.00 seconds Pre-fetching all addons resources implementing the fetch hook: $ drenv fetch envs/regional-dr.yaml -v 2024-03-12 22:51:48,884 INFO [rdr] Fetching 2024-03-12 22:51:48,892 INFO [rdr] Running addons/ocm-controller/fetch 2024-03-12 22:51:49,041 DEBUG [rdr] Fetching /home/github/.cache/drenv/addons/ocm-controller.yaml 2024-03-12 22:52:22,202 INFO [rdr] addons/ocm-controller/fetch completed in 33.31 seconds 2024-03-12 22:52:22,203 INFO [rdr] Fetching finishied in 33.32 seconds Fetching again does nothing since the resource already exists: $ drenv fetch envs/regional-dr.yaml -v 2024-03-12 22:52:27,423 INFO [rdr] Fetching 2024-03-12 22:52:27,427 INFO [rdr] Running addons/ocm-controller/fetch 2024-03-12 22:52:27,591 INFO [rdr] addons/ocm-controller/fetch completed in 0.16 seconds 2024-03-12 22:52:27,592 INFO [rdr] Fetching finishied in 0.17 seconds Staring an addon uses the cached resources without accessing the network: $ addons/ocm-controller/start hub Deploying ocm controller customresourcedefinition.apiextensions.k8s.io/clusterclaims.hive.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterdeployments.hive.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterpools.hive.openshift.io configured customresourcedefinition.apiextensions.k8s.io/managedclusteractions.action.open-cluster-management.io configured customresourcedefinition.apiextensions.k8s.io/managedclusterimageregistries.imageregistry.open-cluster-management.io configured customresourcedefinition.apiextensions.k8s.io/managedclusterinfos.internal.open-cluster-management.io configured customresourcedefinition.apiextensions.k8s.io/managedclusterviews.view.open-cluster-management.io configured serviceaccount/ocm-foundation-sa unchanged clusterrole.rbac.authorization.k8s.io/managed-cluster-workmgr unchanged clusterrole.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged clusterrolebinding.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged deployment.apps/ocm-controller unchanged clustermanagementaddon.addon.open-cluster-management.io/work-manager unchanged Waiting for ocm controller rollout deployment "ocm-controller" successfully rolled out Drop the cache: $ drenv clear envs/regional-dr.yaml -v 2024-03-12 22:52:50,418 INFO [rdr] Clearing cache 2024-03-12 22:52:50,419 INFO [rdr] Fetching finishied in 0.00 seconds Staring an addon fetch the resources again transparently: $ addons/ocm-controller/start hub Deploying ocm controller Fetching /home/github/.cache/drenv/addons/ocm-controller.yaml customresourcedefinition.apiextensions.k8s.io/clusterclaims.hive.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterdeployments.hive.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterpools.hive.openshift.io configured customresourcedefinition.apiextensions.k8s.io/managedclusteractions.action.open-cluster-management.io configured customresourcedefinition.apiextensions.k8s.io/managedclusterimageregistries.imageregistry.open-cluster-management.io configured customresourcedefinition.apiextensions.k8s.io/managedclusterinfos.internal.open-cluster-management.io configured customresourcedefinition.apiextensions.k8s.io/managedclusterviews.view.open-cluster-management.io configured serviceaccount/ocm-foundation-sa unchanged clusterrole.rbac.authorization.k8s.io/managed-cluster-workmgr unchanged clusterrole.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged clusterrolebinding.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged deployment.apps/ocm-controller unchanged clustermanagementaddon.addon.open-cluster-management.io/work-manager unchanged Waiting for ocm controller rollout deployment "ocm-controller" successfully rolled out Signed-off-by: Nir Soffer <nsoffer@redhat.com>
return os.path.expanduser(f"~/{cache_home}/drenv/{key}") | ||
|
||
|
||
def fetch(kustomization_dir, dest, log=print): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if fetch took a key and returned a path. If you have a strong reason to have it this way, I am willing to merge it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this, but we sill need to expose cache.path(). I was not sure about this change, so I kept this as 2 simple operations.
Lets add cache for few more addons and evaluate again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
Example run show ocm-controller finishing in 31 seconds (the cached already exists):
And we failed later waiting for the managed clusters, need to debug this on the host. |
For the failure of ocm-cluster
My simple solution that is working well for me is to change this section of the regional-dr.yaml
to
Running ocm-cluster addon in the second worker of the profile means it will start as soon as the cluster starts and the hub doesn't get time to finish ocm hub setup. |
It should work because ocm-cluster waits for the hub before starting. I guess we don't wait correctly for the hub. This change - using only 2 workers instead of 3 will make start slower on machines with good network, so we should understand better this failure before changing the envs. |
Nest run ocm-controller finished in 18 seconds:
|
@raghavendra-talur also with current code ocm-cluster starts long time after the hub completed everything: hub starts here:
hub completed last addon:
ocm-cluster started more than 30 seconds later:
But this is a successful run, we need to compare to failed run. |
try: | ||
shutil.rmtree(cache_dir) | ||
except FileNotFoundError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clear implementation should move to drenv.cache.clear().
for worker in env["workers"]: | ||
for addon in worker["addons"]: | ||
found[addon["name"]] = addon | ||
return found.values() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should move to envfile.
When we timeout waiting for managedcluster, we see this:
Looks like the auto-accept feature is flaky. And the managed cluster does not have status - looks like it is not reconciled:
|
Oh, good debugging. So may be a bug in ocm then. As a workaround, you can probably delete the pod and see if it reconciles when it starts again. |
Sure, I need to sped more time debugging this. The last detail I found is that when this happens, the klusterlet-work-agent deployment on the affected managed cluster is not ready (0/1) when this happens. So the broken flow is this:
So clusteradm join is broken, it returns with zero exit code when the managed cluster is not accepted. Never seen on local environment, so this is probably related to the networking issues. I'll try to debug this more next week and open ocm issue. Related ocm issue: open-cluster-management-io/clusteradm#334 |
We see too many random failures on the lab running the e2e job. The most common error is a git fetch error when applying ocm-controller kustomization. Search shows that there is no way to avoid this error.
This change introduces a cache for kustomization directories, keeping the output of
kustomize build kustomization_dir
locally, and applying the cached resource when starting the addon.Changes:
The start hook can now fetch the cahe and apply the cached resource. See ocm-controller for example.
Add new "fetch" hook, called using the
drenv fetch
command. This can be used to create the cache for all addons implementing this hook periodically (e.g in a cron daily job). See ocm-controller fetch hook for example.Add new "clear" command to clear the cache. There is no automatic invalidation based on cache age.
To cache builds itself automatically, but it has to be invalidated manually. To keep the cache fresh, you can run
drenv clear
anddrenv fetch
periodically from a cron job.The following examples demonstrate how to cache works.
Clearing the cache:
Pre-fetching all addons resources implementing the fetch hook:
Fetching again does nothing since the resource already exists:
Staring an addon uses the cached resources without accessing the network:
Drop the cache:
Staring an addon fetch the resources again transparently: