Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSM usage on high load #1031

Closed
Bolodya1997 opened this issue Jul 20, 2021 · 13 comments
Closed

NSM usage on high load #1031

Bolodya1997 opened this issue Jul 20, 2021 · 13 comments
Assignees

Comments

@Bolodya1997
Copy link

Bolodya1997 commented Jul 20, 2021

Description

This issue shows results of NSM behavior on high load investigation.

Context

Environment 8Gb RAM, 2x2.4Gh CPU, Ubuntu 18.04.5 LTS (VM)
Client cmd-nsc with setup:

- name: NSM_NETWORK_SERVICES
  value: kernel://icmp-responder/nsm-1
- name: NSM_REQUEST_TIMEOUT
  value: 1m
- name: NSM_MAX_TOKEN_LIFETIME
  value: 10m

Endpoint cmd-icmp-responder with setup:

- name: NSE_CIDR_PREFIX
  value: 172.16.1.100/16

Test scenario

  1. Start NSM on single kind node.
  2. Start Endpoint.
  3. Start 50 Clients with replica set.
  4. Wait for the Clients to start and connect to the Endpoint.
  5. Kill all Clients pods (old pods stop, new pods start).
  6. Wait for the new Clients to start and connect to the Endpoint.

Behavior

In lack of CPU/memory resources NSM starts working slow, passing chain element can take 1-2s, performing a gRPC request can also take 1-2s.
It results in Request/Close timeouts for the new/old Clients and so reveals such issues:

  1. Timeout, Expire stop timer if Close, Unregister fails #1028
  2. Close resources on expired Request context #1026
  3. Resources leak until timeout if response fails to return to the Client #1020

So currently we have such state for the NSM in the different periods of time:

before high load during high load after high load, before timeout after high load, after timeout
Requests, Closes are failing - + - -
Requests repeated during the cmd-nsc pod restart are failing - - - -
resources are leaking - + - -
some leaked resources are NOT eventually getting closed - + + +

So in general we have NSM still working and Client pods eventually connecting to the NSM after some retries even on high load, but we have problems with leaked resources even after high load ends and timeout happens.

Fixing [1, 2] would lead us to the following:

before high load during high load after high load, before timeout after high load, after timeout
Requests, Closes are failing - + - -
Requests repeated during the cmd-nsc pod restart are failing - - - -
resources are leaking - + - -
some leaked resources are NOT eventually getting closed - + + -

Fixing [3] can lead us very close to the following:

before high load during high load after high load, before timeout after high load, after timeout
Requests, Closes are failing - + - -
Requests repeated during the cmd-nsc pod restart are failing - - - -
resources are leaking - + - -
some leaked resources are NOT eventually getting closed - - - -

Actually resources are leaking is fully caused by Requests, Closes are failing. Fixing [3] should even mean it caused only be Closes are failing. Doesn't look like we can fully fix this issue, because in the worst case we can get Close event not reaching NSMgr until the context timeout happens (networkservicemesh/deployments-k8s#2085), but we can try to improve it in different ways:

  1. Add closer server to the NSMgr chain #1032
  2. Dynamically increase dial timeout in connect client #1033
  3. Add queue server chain element #1034
@Bolodya1997 Bolodya1997 changed the title [draft] NSM usage on high load NSM usage on high load Jul 20, 2021
@Bolodya1997
Copy link
Author

@edwarnicke
Please share your thoughts :)

Do we need this improvements? Or fixing just [1-3] issues (or probably [1-2]) will be already enough for us?

@edwarnicke
Copy link
Member

Here's the better question: why is 50 Clients exhausting resources enough to lead to this behavior?

@Bolodya1997
Copy link
Author

Here's the better question: why is 50 Clients exhausting resources enough to lead to this behavior?

Actually it is not 50 clients - it is 50 clients sending Close and 50 client sending Request, so it is mostly like 100 clients.
Simply starting/stopping 50 clients doesn’t provide any issue.

@edwarnicke
Copy link
Member

edwarnicke commented Jul 20, 2021

Actually it is not 50 clients - it is 50 clients sending Close and 50 client sending Request, so it is mostly like 100 clients.
Simply starting/stopping 50 clients doesn’t provide any issue.

Good to know. But even so.. I'm surprised that 100 Clients is causing problems. Do we know why? What's the bottle neck in NSM? Or is NSM just getting throttled out by 100 Pods sharing the Node?

@Bolodya1997
Copy link
Author

Or is NSM just getting throttled out by 100 Pods sharing the Node?

Yes, whole NSM just starts working incredibly slower.

@edwarnicke
Copy link
Member

@Bolodya1997 OK.. do we have a sense of why?

@Bolodya1997
Copy link
Author

I guess mostly because of swapping, top shows huge memory usage on the VM.

@edwarnicke
Copy link
Member

Ah... what is using the memory?

@Bolodya1997
Copy link
Author

Just rechecked right now, it is not related to swapping - RAM is consumed only up to 40% on VM.
It is only related to CPU, and so it actually looks like we have linear dependency between the number of clients and CPU consumed (mostly by kubelet, kube-apiserver, spire, nsmgr, forwarder). So increasing this count to 50 (100 on peak) clients needs just more CPU resources that I have on VM.

@Bolodya1997
Copy link
Author

Bolodya1997 commented Jul 21, 2021

Tested this on packet, everything is OK for 50 (100 on peak) clients.
Also tested packet with 80 (160 on peak) clients, looks like we possibly have a bottleneck in forwarder -> VPP communications - it takes ~18s to configure 2 tap interfaces (for Client and Endpoint) with MTU, routes, IP addresses and setting up, but NSM part is still not taking too much time.

@edwarnicke
Copy link
Member

it takes ~18s to configure 2 tap interfaces (for Client and Endpoint) with MTU, routes, IP addresses and setting up, but NSM part is still not taking too much time.

That's interesting... do you have the detailed logs that shows the particular things that are taking so long to program? I typically see that sort of thing taking less than 100ms locally... so I'm curious where the bottleneck is there in your packet runs. You can get the detail logs by setting NSM_LOG_LEVEL to 'DEBUG'.

@Bolodya1997
Copy link
Author

Here are the logs for the single client:
nsc-kernel-5gg4t.forwarder.log

@Bolodya1997
Copy link
Author

Filed an issue to track a VPP issue - networkservicemesh/sdk-vpp#345.
Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants