Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce addons cache #1239

Merged
merged 1 commit into from
Mar 14, 2024
Merged

Introduce addons cache #1239

merged 1 commit into from
Mar 14, 2024

Conversation

nirs
Copy link
Member

@nirs nirs commented Mar 12, 2024

We see too many random failures on the lab running the e2e job. The most common error is a git fetch error when applying ocm-controller kustomization. Search shows that there is no way to avoid this error.

This change introduces a cache for kustomization directories, keeping the output of kustomize build kustomization_dir locally, and applying the cached resource when starting the addon.

Changes:

  • The start hook can now fetch the cahe and apply the cached resource. See ocm-controller for example.

  • Add new "fetch" hook, called using the drenv fetch command. This can be used to create the cache for all addons implementing this hook periodically (e.g in a cron daily job). See ocm-controller fetch hook for example.

  • Add new "clear" command to clear the cache. There is no automatic invalidation based on cache age.

To cache builds itself automatically, but it has to be invalidated manually. To keep the cache fresh, you can run drenv clear and drenv fetch periodically from a cron job.

The following examples demonstrate how to cache works.

Clearing the cache:

$ drenv clear envs/regional-dr.yaml
2024-03-12 22:51:37,001 INFO    [rdr] Clearing cache
2024-03-12 22:51:37,002 INFO    [rdr] Fetching finishied in 0.00 seconds

Pre-fetching all addons resources implementing the fetch hook:

$ drenv fetch envs/regional-dr.yaml -v
2024-03-12 22:51:48,884 INFO    [rdr] Fetching
2024-03-12 22:51:48,892 INFO    [rdr] Running addons/ocm-controller/fetch
2024-03-12 22:51:49,041 DEBUG   [rdr] Fetching /home/github/.cache/drenv/addons/ocm-controller.yaml
2024-03-12 22:52:22,202 INFO    [rdr] addons/ocm-controller/fetch completed in 33.31 seconds
2024-03-12 22:52:22,203 INFO    [rdr] Fetching finishied in 33.32 seconds

Fetching again does nothing since the resource already exists:

$ drenv fetch envs/regional-dr.yaml -v
2024-03-12 22:52:27,423 INFO    [rdr] Fetching
2024-03-12 22:52:27,427 INFO    [rdr] Running addons/ocm-controller/fetch
2024-03-12 22:52:27,591 INFO    [rdr] addons/ocm-controller/fetch completed in 0.16 seconds
2024-03-12 22:52:27,592 INFO    [rdr] Fetching finishied in 0.17 seconds

Staring an addon uses the cached resources without accessing the network:

$ addons/ocm-controller/start hub
Deploying ocm controller
customresourcedefinition.apiextensions.k8s.io/clusterclaims.hive.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterdeployments.hive.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterpools.hive.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/managedclusteractions.action.open-cluster-management.io configured
customresourcedefinition.apiextensions.k8s.io/managedclusterimageregistries.imageregistry.open-cluster-management.io configured
customresourcedefinition.apiextensions.k8s.io/managedclusterinfos.internal.open-cluster-management.io configured
customresourcedefinition.apiextensions.k8s.io/managedclusterviews.view.open-cluster-management.io configured
serviceaccount/ocm-foundation-sa unchanged
clusterrole.rbac.authorization.k8s.io/managed-cluster-workmgr unchanged
clusterrole.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged
clusterrolebinding.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged
deployment.apps/ocm-controller unchanged
clustermanagementaddon.addon.open-cluster-management.io/work-manager unchanged
Waiting for ocm controller rollout
deployment "ocm-controller" successfully rolled out

Drop the cache:

$ drenv clear envs/regional-dr.yaml -v
2024-03-12 22:52:50,418 INFO    [rdr] Clearing cache
2024-03-12 22:52:50,419 INFO    [rdr] Fetching finishied in 0.00 seconds

Staring an addon fetch the resources again transparently:

$ addons/ocm-controller/start hub
Deploying ocm controller
Fetching /home/github/.cache/drenv/addons/ocm-controller.yaml
customresourcedefinition.apiextensions.k8s.io/clusterclaims.hive.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterdeployments.hive.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterpools.hive.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/managedclusteractions.action.open-cluster-management.io configured
customresourcedefinition.apiextensions.k8s.io/managedclusterimageregistries.imageregistry.open-cluster-management.io configured
customresourcedefinition.apiextensions.k8s.io/managedclusterinfos.internal.open-cluster-management.io configured
customresourcedefinition.apiextensions.k8s.io/managedclusterviews.view.open-cluster-management.io configured
serviceaccount/ocm-foundation-sa unchanged
clusterrole.rbac.authorization.k8s.io/managed-cluster-workmgr unchanged
clusterrole.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged
clusterrolebinding.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged
deployment.apps/ocm-controller unchanged
clustermanagementaddon.addon.open-cluster-management.io/work-manager unchanged
Waiting for ocm controller rollout
deployment "ocm-controller" successfully rolled out

We see too many random failures on the lab running the e2e job. The most
common error is a git fetch error when applying ocm-controller
kustomization. Search shows that there is no way to avoid this error.

This change introduces a cache for kustomization directories, keeping
the output of `kustomize build kustomization_dir` locally, and applying
the cached resource when starting the addon.

Changes:

- The start hook can now fetch the cahe and apply the cached resource.
  See ocm-controller for example.

- Add new "fetch" hook, called using the `drenv fetch` command. This can
  be used to create the cache for all addons implementing this hook
  periodically (e.g in a cron daily job). See ocm-controller fetch hook
  for example.

- Add new "clear" command to clear the cache. There is no automatic
  invalidation based on cache age.

To cache builds itself automatically, but it has to be invalidated
manually. To keep the cache fresh, you can run `drenv clear` and `drenv
fetch` periodically from a cron job.

The following examples demonstrate how to cache works.

Clearing the cache:

    $ drenv clear envs/regional-dr.yaml
    2024-03-12 22:51:37,001 INFO    [rdr] Clearing cache
    2024-03-12 22:51:37,002 INFO    [rdr] Fetching finishied in 0.00 seconds

Pre-fetching all addons resources implementing the fetch hook:

    $ drenv fetch envs/regional-dr.yaml -v
    2024-03-12 22:51:48,884 INFO    [rdr] Fetching
    2024-03-12 22:51:48,892 INFO    [rdr] Running addons/ocm-controller/fetch
    2024-03-12 22:51:49,041 DEBUG   [rdr] Fetching /home/github/.cache/drenv/addons/ocm-controller.yaml
    2024-03-12 22:52:22,202 INFO    [rdr] addons/ocm-controller/fetch completed in 33.31 seconds
    2024-03-12 22:52:22,203 INFO    [rdr] Fetching finishied in 33.32 seconds

Fetching again does nothing since the resource already exists:

    $ drenv fetch envs/regional-dr.yaml -v
    2024-03-12 22:52:27,423 INFO    [rdr] Fetching
    2024-03-12 22:52:27,427 INFO    [rdr] Running addons/ocm-controller/fetch
    2024-03-12 22:52:27,591 INFO    [rdr] addons/ocm-controller/fetch completed in 0.16 seconds
    2024-03-12 22:52:27,592 INFO    [rdr] Fetching finishied in 0.17 seconds

Staring an addon uses the cached resources without accessing the
network:

    $ addons/ocm-controller/start hub
    Deploying ocm controller
    customresourcedefinition.apiextensions.k8s.io/clusterclaims.hive.openshift.io configured
    customresourcedefinition.apiextensions.k8s.io/clusterdeployments.hive.openshift.io configured
    customresourcedefinition.apiextensions.k8s.io/clusterpools.hive.openshift.io configured
    customresourcedefinition.apiextensions.k8s.io/managedclusteractions.action.open-cluster-management.io configured
    customresourcedefinition.apiextensions.k8s.io/managedclusterimageregistries.imageregistry.open-cluster-management.io configured
    customresourcedefinition.apiextensions.k8s.io/managedclusterinfos.internal.open-cluster-management.io configured
    customresourcedefinition.apiextensions.k8s.io/managedclusterviews.view.open-cluster-management.io configured
    serviceaccount/ocm-foundation-sa unchanged
    clusterrole.rbac.authorization.k8s.io/managed-cluster-workmgr unchanged
    clusterrole.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged
    clusterrolebinding.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged
    deployment.apps/ocm-controller unchanged
    clustermanagementaddon.addon.open-cluster-management.io/work-manager unchanged
    Waiting for ocm controller rollout
    deployment "ocm-controller" successfully rolled out

Drop the cache:

    $ drenv clear envs/regional-dr.yaml -v
    2024-03-12 22:52:50,418 INFO    [rdr] Clearing cache
    2024-03-12 22:52:50,419 INFO    [rdr] Fetching finishied in 0.00 seconds

Staring an addon fetch the resources again transparently:

    $ addons/ocm-controller/start hub
    Deploying ocm controller
    Fetching /home/github/.cache/drenv/addons/ocm-controller.yaml
    customresourcedefinition.apiextensions.k8s.io/clusterclaims.hive.openshift.io configured
    customresourcedefinition.apiextensions.k8s.io/clusterdeployments.hive.openshift.io configured
    customresourcedefinition.apiextensions.k8s.io/clusterpools.hive.openshift.io configured
    customresourcedefinition.apiextensions.k8s.io/managedclusteractions.action.open-cluster-management.io configured
    customresourcedefinition.apiextensions.k8s.io/managedclusterimageregistries.imageregistry.open-cluster-management.io configured
    customresourcedefinition.apiextensions.k8s.io/managedclusterinfos.internal.open-cluster-management.io configured
    customresourcedefinition.apiextensions.k8s.io/managedclusterviews.view.open-cluster-management.io configured
    serviceaccount/ocm-foundation-sa unchanged
    clusterrole.rbac.authorization.k8s.io/managed-cluster-workmgr unchanged
    clusterrole.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged
    clusterrolebinding.rbac.authorization.k8s.io/open-cluster-management:ocm:foundation unchanged
    deployment.apps/ocm-controller unchanged
    clustermanagementaddon.addon.open-cluster-management.io/work-manager unchanged
    Waiting for ocm controller rollout
    deployment "ocm-controller" successfully rolled out

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
return os.path.expanduser(f"~/{cache_home}/drenv/{key}")


def fetch(kustomization_dir, dest, log=print):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if fetch took a key and returned a path. If you have a strong reason to have it this way, I am willing to merge it as it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this, but we sill need to expose cache.path(). I was not sure about this change, so I kept this as 2 simple operations.

Lets add cache for few more addons and evaluate again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@nirs
Copy link
Member Author

nirs commented Mar 12, 2024

Example run show ocm-controller finishing in 31 seconds (the cached already exists):

2024-03-13 00:01:36,865 INFO    [rdr-rdr] Starting environment
2024-03-13 00:01:36,[9](https://github.com/RamenDR/ramen/actions/runs/8253679617/job/22576104791#step:8:10)91 INFO    [rdr-dr1] Starting minikube cluster
2024-03-13 00:01:38,126 INFO    [rdr-dr2] Starting minikube cluster
2024-03-13 00:01:39,061 INFO    [rdr-hub] Starting minikube cluster
2024-03-13 00:03:39,595 INFO    [rdr-dr1] Cluster started in [12](https://github.com/RamenDR/ramen/actions/runs/8253679617/job/22576104791#step:8:13)2.60 seconds
2024-03-[13](https://github.com/RamenDR/ramen/actions/runs/8253679617/job/22576104791#step:8:14) 00:03:39,595 INFO    [rdr-dr1/0] Running addons/cert-manager/start
2024-03-13 00:03:44,415 INFO    [rdr-hub] Cluster started in 125.35 seconds
2024-03-13 00:03:44,416 INFO    [rdr-hub/0] Running addons/ocm-hub/start
2024-03-13 00:03:59,625 INFO    [rdr-dr1/0] addons/cert-manager/start completed in 20.03 seconds
2024-03-13 00:03:59,625 INFO    [rdr-dr1/0] Running addons/rook-operator/start
2024-03-13 00:04:13,381 INFO    [rdr-dr2] Cluster started in [15](https://github.com/RamenDR/ramen/actions/runs/8253679617/job/22576104791#step:8:16)5.26 seconds
2024-03-13 00:04:13,382 INFO    [rdr-dr2/0] Running addons/cert-manager/start
2024-03-13 00:04:34,003 INFO    [rdr-dr2/0] addons/cert-manager/start completed in 20.62 seconds
2024-03-13 00:04:34,003 INFO    [rdr-dr2/0] Running addons/rook-operator/start
2024-03-13 00:04:54,3[16](https://github.com/RamenDR/ramen/actions/runs/8253679617/job/22576104791#step:8:17) INFO    [rdr-dr1/0] addons/rook-operator/start completed in 54.69 seconds
[20](https://github.com/RamenDR/ramen/actions/runs/8253679617/job/22576104791#step:8:21)24-03-13 00:04:54,316 INFO    [rdr-dr1/0] Running addons/rook-cluster/start
2024-03-13 00:04:56,992 INFO    [rdr-hub/0] addons/ocm-hub/start completed in 72.58 seconds
2024-03-13 00:04:56,992 INFO    [rdr-hub/0] Running addons/ocm-controller/start
2024-03-13 00:05:[22](https://github.com/RamenDR/ramen/actions/runs/8253679617/job/22576104791#step:8:23),782 INFO    [rdr-dr2/0] addons/rook-operator/start completed in 48.78 seconds
20[24](https://github.com/RamenDR/ramen/actions/runs/8253679617/job/22576104791#step:8:25)-03-13 00:05:22,783 INFO    [rdr-dr2/0] Running addons/rook-cluster/start
2024-03-13 00:05:28,243 INFO    [rdr-hub/0] addons/ocm-controller/start completed in 31.[25](https://github.com/RamenDR/ramen/actions/runs/8253679617/job/22576104791#step:8:26) seconds

And we failed later waiting for the managed clusters, need to debug this on the host.

@raghavendra-talur
Copy link
Member

For the failure of ocm-cluster

      drenv.commands.Error: Command failed:
         command: ('kubectl', 'wait', '--context', 'rdr-hub', 'managedcluster/rdr-dr1', '--for=condition=ManagedClusterConditionAvailable', '--timeout=600s')

My simple solution that is working well for me is to change this section of the regional-dr.yaml

      - addons:
          - name: cert-manager
          - name: rook-operator
          - name: rook-cluster
          - name: rook-pool
          - name: rook-toolbox
      - addons:
          - name: ocm-cluster
            args: ["$name", "hub"]
          - name: recipe
      - addons:
          - name: csi-addons
          - name: olm
          - name: minio
          - name: velero

to

      - addons:
          - name: cert-manager
          - name: rook-operator
          - name: rook-cluster
          - name: rook-pool
          - name: rook-toolbox
      - addons:
          - name: csi-addons
          - name: olm
          - name: minio
          - name: velero
          - name: ocm-cluster
            args: ["$name", "hub"]
          - name: recipe

Running ocm-cluster addon in the second worker of the profile means it will start as soon as the cluster starts and the hub doesn't get time to finish ocm hub setup.

@nirs
Copy link
Member Author

nirs commented Mar 12, 2024

For the failure of ocm-cluster
...
Running ocm-cluster addon in the second worker of the profile means it will start as soon as the cluster starts and the hub doesn't get time to finish ocm hub setup.

It should work because ocm-cluster waits for the hub before starting. I guess we don't wait correctly for the hub.

This change - using only 2 workers instead of 3 will make start slower on machines with good network, so we should understand better this failure before changing the envs.

@nirs
Copy link
Member Author

nirs commented Mar 12, 2024

Nest run ocm-controller finished in 18 seconds:

2024-03-13 00:56:41,431 INFO    [rdr-hub/0] addons/ocm-controller/start completed in 18.26 seconds

@nirs
Copy link
Member Author

nirs commented Mar 12, 2024

@raghavendra-talur also with current code ocm-cluster starts long time after the hub completed everything:

hub starts here:

2024-03-13 01:21:33,946 INFO    [rdr-hub] Cluster started in 127.29 seconds
2024-03-13 01:21:33,947 INFO    [rdr-hub/0] Running addons/ocm-hub/start

hub completed last addon:

2024-03-13 01:26:58,159 INFO    [rdr-hub/1] addons/submariner/test completed in 72.09 seconds

ocm-cluster started more than 30 seconds later:

2024-03-13 01:27:24,177 INFO    [rdr-dr2/1] Running addons/ocm-cluster/start
2024-03-13 01:27:34,729 INFO    [rdr-dr1/1] Running addons/ocm-cluster/start

But this is a successful run, we need to compare to failed run.

try:
shutil.rmtree(cache_dir)
except FileNotFoundError:
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear implementation should move to drenv.cache.clear().

for worker in env["workers"]:
for addon in worker["addons"]:
found[addon["name"]] = addon
return found.values()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should move to envfile.

@nirs
Copy link
Member Author

nirs commented Mar 13, 2024

When we timeout waiting for managedcluster, we see this:

$ kubectl get managedcluster -A --context rdr-hub
NAME      HUB ACCEPTED   MANAGED CLUSTER URLS                           JOINED   AVAILABLE   AGE
rdr-dr1   true           https://control-plane.minikube.internal:8443   True     True        3m6s
rdr-dr2   false          https://control-plane.minikube.internal:8443                        4m2s

Looks like the auto-accept feature is flaky.

And the managed cluster does not have status - looks like it is not reconciled:

$ kubectl get managedcluster rdr-dr2 -A --context rdr-hub -o yaml
apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
  creationTimestamp: "2024-03-13T01:16:43Z"
  finalizers:
  - managedclusterinfo.finalizers.open-cluster-management.io
  - open-cluster-management.io/managedclusterrole
  - cluster.open-cluster-management.io/api-resource-cleanup
  generation: 2
  labels:
    cluster.open-cluster-management.io/clusterset: default
    name: rdr-dr2
  name: rdr-dr2
  resourceVersion: "2288"
  uid: b0ea5f59-2602-4ca7-99ad-7726e5915fbc
spec:
  hubAcceptsClient: false
  leaseDurationSeconds: 60
  managedClusterClientConfigs:
  - caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCakNDQWU2Z0F3SUJBZ0lCQVRBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwdGFXNXAKYTNWaVpVTkJNQjRYRFRJek1USXhOekl3TWprd01sb1hEVE16TVRJeE5USXdNamt3TWxvd0ZURVRNQkVHQTFVRQpBeE1LYldsdWFXdDFZbVZEUVRDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTXJxCkpSUXNWVFZXN3JydzJ5cENkdTlDLzZ2Y2dkOEF0SUFyeHp2dFlHOVdxRUY3V2JPa0JkczZwZUp2bVpOK0J0a2EKUjc5Znh5VXQrQjd6YTBqREFHeTdVL3NVeUpDN000am0wYnZUeG5XQU5mSlZjMGhRekRJT0k3dHA2Y0ZocHBGcgo5VnpWSEtkVWFLZExDY1h0djhYTGNTYmNDMFZiTVl2dW1ra3EyYWRxajFFYTJhK1R1OC8xT0kwZm5xVHNacjhaCkw1bkYrcTBXMzZCQ2V5YTlHbUdyeHFJSHd5SE9wMTRqZEhkUW5Pa0FsaWhPR0swM090ZUU0bFEwbGhTSzJjaXYKb3d5NVk1RWw0eDdON0ZZaE1MTGkyakE0RS80NnRqcUNxVnptZmxBYnkvYytiNUJWdDlsWFRoaHg0RGNxeVhkMQpXNytoTVpIbEpFOXA1N1hlR3BrQ0F3RUFBYU5oTUY4d0RnWURWUjBQQVFIL0JBUURBZ0trTUIwR0ExVWRKUVFXCk1CUUdDQ3NHQVFVRkJ3TUNCZ2dyQmdFRkJRY0RBVEFQQmdOVkhSTUJBZjhFQlRBREFRSC9NQjBHQTFVZERnUVcKQkJRSHZPczRPeG5DY0ZTUTZoMjZqampQSUkwci96QU5CZ2txaGtpRzl3MEJBUXNGQUFPQ0FRRUFJNzZBdWRkdworZjh6RjhwTDRyZFhkR0NzVUFnNjA1NVBwblUzakRDTDZjdGYwZ2szRWdTVkFyY1J1TVo5VXpJUlM2MmxEVjBICjU1ZEZzWmtXbk9qdCthNllLZSt5MW82Qjh0bDlkVkRYenRlY2crdEZlMFptcDRaekFuVFdJVjZLbWlKcDdMTEYKa0tnZE1WN29zbXVJU2hmUjdENEp6ZUJ0bmlORkdnZ21zU21qNjBtNDQwcXpUMUROWFZBTktOWkJ1Y1owT2dydQo4L3RweG9kSTN2dzBNVit1U1VnU3M5bkg0NTJXbGRzVEVETFY5MXhwaFBXM0pQeGtkMCtWZk5tNXRwTjZLcEtpCmV0N1BlS0lsMnQ2cUUwbW03YkdzVkdGT0NvOEVjMUtwclpCRVB4NTFZTjlYVmxPSFlDYlkxd29oSnZjdmJodG0KUXU0ZmhTUEtYTGx5V2c9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
    url: https://control-plane.minikube.internal:8443
  taints:
  - effect: NoSelect
    key: cluster.open-cluster-management.io/unreachable
    timeAdded: "2024-03-13T01:16:44Z"

@raghavendra-talur
Copy link
Member

When we timeout waiting for managedcluster, we see this:

$ kubectl get managedcluster -A --context rdr-hub
NAME      HUB ACCEPTED   MANAGED CLUSTER URLS                           JOINED   AVAILABLE   AGE
rdr-dr1   true           https://control-plane.minikube.internal:8443   True     True        3m6s
rdr-dr2   false          https://control-plane.minikube.internal:8443                        4m2s

Looks like the auto-accept feature is flaky.

And the managed cluster does not have status - looks like it is not reconciled:

$ kubectl get managedcluster rdr-dr2 -A --context rdr-hub -o yaml
apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
  creationTimestamp: "2024-03-13T01:16:43Z"
  finalizers:
  - managedclusterinfo.finalizers.open-cluster-management.io
  - open-cluster-management.io/managedclusterrole
  - cluster.open-cluster-management.io/api-resource-cleanup
  generation: 2
  labels:
    cluster.open-cluster-management.io/clusterset: default
    name: rdr-dr2
  name: rdr-dr2
  resourceVersion: "2288"
  uid: b0ea5f59-2602-4ca7-99ad-7726e5915fbc
spec:
  hubAcceptsClient: false
  leaseDurationSeconds: 60
  managedClusterClientConfigs:
  - caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCakNDQWU2Z0F3SUJBZ0lCQVRBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwdGFXNXAKYTNWaVpVTkJNQjRYRFRJek1USXhOekl3TWprd01sb1hEVE16TVRJeE5USXdNamt3TWxvd0ZURVRNQkVHQTFVRQpBeE1LYldsdWFXdDFZbVZEUVRDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTXJxCkpSUXNWVFZXN3JydzJ5cENkdTlDLzZ2Y2dkOEF0SUFyeHp2dFlHOVdxRUY3V2JPa0JkczZwZUp2bVpOK0J0a2EKUjc5Znh5VXQrQjd6YTBqREFHeTdVL3NVeUpDN000am0wYnZUeG5XQU5mSlZjMGhRekRJT0k3dHA2Y0ZocHBGcgo5VnpWSEtkVWFLZExDY1h0djhYTGNTYmNDMFZiTVl2dW1ra3EyYWRxajFFYTJhK1R1OC8xT0kwZm5xVHNacjhaCkw1bkYrcTBXMzZCQ2V5YTlHbUdyeHFJSHd5SE9wMTRqZEhkUW5Pa0FsaWhPR0swM090ZUU0bFEwbGhTSzJjaXYKb3d5NVk1RWw0eDdON0ZZaE1MTGkyakE0RS80NnRqcUNxVnptZmxBYnkvYytiNUJWdDlsWFRoaHg0RGNxeVhkMQpXNytoTVpIbEpFOXA1N1hlR3BrQ0F3RUFBYU5oTUY4d0RnWURWUjBQQVFIL0JBUURBZ0trTUIwR0ExVWRKUVFXCk1CUUdDQ3NHQVFVRkJ3TUNCZ2dyQmdFRkJRY0RBVEFQQmdOVkhSTUJBZjhFQlRBREFRSC9NQjBHQTFVZERnUVcKQkJRSHZPczRPeG5DY0ZTUTZoMjZqampQSUkwci96QU5CZ2txaGtpRzl3MEJBUXNGQUFPQ0FRRUFJNzZBdWRkdworZjh6RjhwTDRyZFhkR0NzVUFnNjA1NVBwblUzakRDTDZjdGYwZ2szRWdTVkFyY1J1TVo5VXpJUlM2MmxEVjBICjU1ZEZzWmtXbk9qdCthNllLZSt5MW82Qjh0bDlkVkRYenRlY2crdEZlMFptcDRaekFuVFdJVjZLbWlKcDdMTEYKa0tnZE1WN29zbXVJU2hmUjdENEp6ZUJ0bmlORkdnZ21zU21qNjBtNDQwcXpUMUROWFZBTktOWkJ1Y1owT2dydQo4L3RweG9kSTN2dzBNVit1U1VnU3M5bkg0NTJXbGRzVEVETFY5MXhwaFBXM0pQeGtkMCtWZk5tNXRwTjZLcEtpCmV0N1BlS0lsMnQ2cUUwbW03YkdzVkdGT0NvOEVjMUtwclpCRVB4NTFZTjlYVmxPSFlDYlkxd29oSnZjdmJodG0KUXU0ZmhTUEtYTGx5V2c9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
    url: https://control-plane.minikube.internal:8443
  taints:
  - effect: NoSelect
    key: cluster.open-cluster-management.io/unreachable
    timeAdded: "2024-03-13T01:16:44Z"

Oh, good debugging. So may be a bug in ocm then. As a workaround, you can probably delete the pod and see if it reconciles when it starts again.

@nirs
Copy link
Member Author

nirs commented Mar 13, 2024

Oh, good debugging. So may be a bug in ocm then. As a workaround, you can probably delete the pod and see if it reconciles when it starts again.

Sure, I need to sped more time debugging this. The last detail I found is that when this happens, the klusterlet-work-agent deployment on the affected managed cluster is not ready (0/1) when this happens.

So the broken flow is this:

  • we run clsuteradm init on the hub - succeeds
  • we wait until all hub deployments are rolled out
  • we run clusteradm join --wait on the managed cluster - always succeeds

So clusteradm join is broken, it returns with zero exit code when the managed cluster is not accepted. Never seen on local environment, so this is probably related to the networking issues.

I'll try to debug this more next week and open ocm issue.

Related ocm issue: open-cluster-management-io/clusteradm#334

@raghavendra-talur raghavendra-talur merged commit ba9c3be into RamenDR:main Mar 14, 2024
14 of 15 checks passed
@nirs nirs mentioned this pull request Mar 27, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants