NSM usage on high load #1031

Bolodya1997 · 2021-07-20T04:28:37Z

Description

This issue shows results of NSM behavior on high load investigation.

Context

Environment 8Gb RAM, 2x2.4Gh CPU, Ubuntu 18.04.5 LTS (VM)
Client cmd-nsc with setup:

- name: NSM_NETWORK_SERVICES
  value: kernel://icmp-responder/nsm-1
- name: NSM_REQUEST_TIMEOUT
  value: 1m
- name: NSM_MAX_TOKEN_LIFETIME
  value: 10m

Endpoint cmd-icmp-responder with setup:

- name: NSE_CIDR_PREFIX
  value: 172.16.1.100/16

Test scenario

Start NSM on single kind node.
Start Endpoint.
Start 50 Clients with replica set.
Wait for the Clients to start and connect to the Endpoint.
Kill all Clients pods (old pods stop, new pods start).
Wait for the new Clients to start and connect to the Endpoint.

Behavior

In lack of CPU/memory resources NSM starts working slow, passing chain element can take 1-2s, performing a gRPC request can also take 1-2s.
It results in Request/Close timeouts for the new/old Clients and so reveals such issues:

So currently we have such state for the NSM in the different periods of time:

	before high load	during high load	after high load, before timeout	after high load, after timeout
Requests, Closes are failing	-	+	-	-
Requests repeated during the `cmd-nsc` pod restart are failing	-	-	-	-
resources are leaking	-	+	-	-
some leaked resources are NOT eventually getting closed	-	+	+	+

So in general we have NSM still working and Client pods eventually connecting to the NSM after some retries even on high load, but we have problems with leaked resources even after high load ends and timeout happens.

Fixing [1, 2] would lead us to the following:

	before high load	during high load	after high load, before timeout	after high load, after timeout
Requests, Closes are failing	-	+	-	-
Requests repeated during the `cmd-nsc` pod restart are failing	-	-	-	-
resources are leaking	-	+	-	-
some leaked resources are NOT eventually getting closed	-	+	+	-

Fixing [3] can lead us very close to the following:

	before high load	during high load	after high load, before timeout	after high load, after timeout
Requests, Closes are failing	-	+	-	-
Requests repeated during the `cmd-nsc` pod restart are failing	-	-	-	-
resources are leaking	-	+	-	-
some leaked resources are NOT eventually getting closed	-	-	-	-

Actually resources are leaking is fully caused by Requests, Closes are failing. Fixing [3] should even mean it caused only be Closes are failing. Doesn't look like we can fully fix this issue, because in the worst case we can get Close event not reaching NSMgr until the context timeout happens (networkservicemesh/deployments-k8s#2085), but we can try to improve it in different ways:

The text was updated successfully, but these errors were encountered:

Bolodya1997 · 2021-07-20T05:04:26Z

@edwarnicke
Please share your thoughts :)

Do we need this improvements? Or fixing just [1-3] issues (or probably [1-2]) will be already enough for us?

edwarnicke · 2021-07-20T10:53:49Z

Here's the better question: why is 50 Clients exhausting resources enough to lead to this behavior?

Bolodya1997 · 2021-07-20T11:32:05Z

Here's the better question: why is 50 Clients exhausting resources enough to lead to this behavior?

Actually it is not 50 clients - it is 50 clients sending Close and 50 client sending Request, so it is mostly like 100 clients.
Simply starting/stopping 50 clients doesn’t provide any issue.

edwarnicke · 2021-07-20T11:33:50Z

Actually it is not 50 clients - it is 50 clients sending Close and 50 client sending Request, so it is mostly like 100 clients.
Simply starting/stopping 50 clients doesn’t provide any issue.

Good to know. But even so.. I'm surprised that 100 Clients is causing problems. Do we know why? What's the bottle neck in NSM? Or is NSM just getting throttled out by 100 Pods sharing the Node?

Bolodya1997 · 2021-07-20T11:56:10Z

Or is NSM just getting throttled out by 100 Pods sharing the Node?

Yes, whole NSM just starts working incredibly slower.

edwarnicke · 2021-07-20T12:22:52Z

@Bolodya1997 OK.. do we have a sense of why?

Bolodya1997 · 2021-07-20T12:25:30Z

I guess mostly because of swapping, top shows huge memory usage on the VM.

edwarnicke · 2021-07-20T12:29:12Z

Ah... what is using the memory?

Bolodya1997 · 2021-07-20T14:27:02Z

Just rechecked right now, it is not related to swapping - RAM is consumed only up to 40% on VM.
It is only related to CPU, and so it actually looks like we have linear dependency between the number of clients and CPU consumed (mostly by kubelet, kube-apiserver, spire, nsmgr, forwarder). So increasing this count to 50 (100 on peak) clients needs just more CPU resources that I have on VM.

Bolodya1997 · 2021-07-21T12:00:17Z

Tested this on packet, everything is OK for 50 (100 on peak) clients.
Also tested packet with 80 (160 on peak) clients, looks like we possibly have a bottleneck in forwarder -> VPP communications - it takes ~18s to configure 2 tap interfaces (for Client and Endpoint) with MTU, routes, IP addresses and setting up, but NSM part is still not taking too much time.

edwarnicke · 2021-07-21T13:34:11Z

it takes ~18s to configure 2 tap interfaces (for Client and Endpoint) with MTU, routes, IP addresses and setting up, but NSM part is still not taking too much time.

That's interesting... do you have the detailed logs that shows the particular things that are taking so long to program? I typically see that sort of thing taking less than 100ms locally... so I'm curious where the bottleneck is there in your packet runs. You can get the detail logs by setting NSM_LOG_LEVEL to 'DEBUG'.

Bolodya1997 · 2021-07-21T14:12:54Z

Here are the logs for the single client:
nsc-kernel-5gg4t.forwarder.log

Bolodya1997 · 2021-08-17T14:12:50Z

Filed an issue to track a VPP issue - networkservicemesh/sdk-vpp#345.
Closing this one.

This was referenced Jul 20, 2021

Add closer server to the NSMgr chain #1032

Closed

Dynamically increase dial timeout in connect client #1033

Open

Add queue server chain element #1034

Closed

Bolodya1997 changed the title ~~[draft] NSM usage on high load~~ NSM usage on high load Jul 20, 2021

This was referenced Jul 20, 2021

Sometimes Close doesn't reach nsmgr networkservicemesh/deployments-k8s#2085

Closed

Sometimes heal continue working indefinitely networkservicemesh/deployments-k8s#2003

Closed

Bolodya1997 self-assigned this Jul 20, 2021

Bolodya1997 mentioned this issue Aug 17, 2021

VPP allocates kernel interfaces for a single connection for about 18s in 80 clients case networkservicemesh/sdk-vpp#345

Open

Bolodya1997 closed this as completed Aug 17, 2021

Bolodya1997 mentioned this issue Aug 27, 2021

Introducing the begin chain element #1072

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NSM usage on high load #1031

NSM usage on high load #1031

Bolodya1997 commented Jul 20, 2021 •

edited

Loading

Bolodya1997 commented Jul 20, 2021

edwarnicke commented Jul 20, 2021

Bolodya1997 commented Jul 20, 2021

edwarnicke commented Jul 20, 2021 •

edited

Loading

Bolodya1997 commented Jul 20, 2021

edwarnicke commented Jul 20, 2021

Bolodya1997 commented Jul 20, 2021

edwarnicke commented Jul 20, 2021

Bolodya1997 commented Jul 20, 2021

Bolodya1997 commented Jul 21, 2021 •

edited

Loading

edwarnicke commented Jul 21, 2021

Bolodya1997 commented Jul 21, 2021

Bolodya1997 commented Aug 17, 2021

NSM usage on high load #1031

NSM usage on high load #1031

Comments

Bolodya1997 commented Jul 20, 2021 • edited Loading

Description

Context

Test scenario

Behavior

Bolodya1997 commented Jul 20, 2021

edwarnicke commented Jul 20, 2021

Bolodya1997 commented Jul 20, 2021

edwarnicke commented Jul 20, 2021 • edited Loading

Bolodya1997 commented Jul 20, 2021

edwarnicke commented Jul 20, 2021

Bolodya1997 commented Jul 20, 2021

edwarnicke commented Jul 20, 2021

Bolodya1997 commented Jul 20, 2021

Bolodya1997 commented Jul 21, 2021 • edited Loading

edwarnicke commented Jul 21, 2021

Bolodya1997 commented Jul 21, 2021

Bolodya1997 commented Aug 17, 2021

Bolodya1997 commented Jul 20, 2021 •

edited

Loading

edwarnicke commented Jul 20, 2021 •

edited

Loading

Bolodya1997 commented Jul 21, 2021 •

edited

Loading