Add WebAssembly SIMD128 implementation and Node.JS support #825

easyaspi314 · 2023-03-14T16:25:20Z

Support WebAssembly SIMD128 via the NEON path
- The Emscripten SDK includes the SIMDeverywhere polyfill for <arm_neon.h>
- Since SIMD128 seems to be designed after a subset of NEON, the double width NEON path maps 1:1 and gets full speed with few modifications
  - Note: XXH3_NEON_LANES != 8 IS BAD since SIMDe scalarizes half vectors
- On v8, about 2x faster than scalar on AArch64, 3-4x faster on x86_64
- Firefox seems to be slightly faster on x86, slightly slower on aarch64
Adds a few fixes for compiling with Node.JS (make NODE_JS=1)
- Requires linking nodefs and noderawfs to access local files
- Hook into node.js directly using inline JavaScript to access tty.isatty() from Node instead of the broken isatty() in libc (which always returns 1 for stdin/stdout/stderr)
Add wasm/asmjs to the welcome message
Remove emcc output files on make clean
Add basic CI tests for both WASM and asm.js (the latter of which is painfully slow lol)
Also since I needed to modify XXH3_scrambleAcc_neon, I found a stupidly obvious optimization to standard NEON.

Due to the fact that WebAssembly is JIT-compiled and that JavaScript timer precision is limited due to Spectre, the results may vary.

NEON gets almost native speed on v8, but x86_64 does not mainly due to the instruction sets not exactly matching up. Firefox's aarch64 JIT seems to be considerably slower when it comes to WASM.

xxhsum -b on an AArch64 Google Tensor G1 (Cortex-X1 @ 2.80 GHz):

Browser/engine	Scalar XXH3_64b	Scalar XXH128	WASM128 XXH3_64B	WASM128 XXH128
Node 18.13 / v8 10.2.154 (termux)	8421.7 MB/s	8161.7 MB/s	16485.4 MB/s	16355.0 MB/s
Firefox Nightly for Android 112.0a1	7850.5 MB/s	7870.8 MB/s	6148.4 MB/s	5861.7 MB/s

x86_64 (with AVX2) Core i7-8650U @ 2.11 GHz (turbo 4.20 GHz):

Browser/engine	Scalar XXH3_64b	Scalar XXH128	WASM128 XXH3_64B	WASM128 XXH128
Node 18.14 / v8 10.2.154 (mingw64)	3605.8 MB/s	3473.8 MB/s	9658.2 MB/s	11535.5 MB/s
Firefox for Windows 110.0.1	6406.9 MB/s	6004.3 MB/s	9249.1 MB/s	9288.8 MB/s

(Not sure why XXH128 is that much faster on x86 but not ARM. Maybe code alignment memes?).

Cyan4973 · 2023-03-14T16:41:36Z

The added source code is pretty clean, this is easy to review.

It seems you want to test it a little longer ?
In which case, could you make this PR "draft", so I don't wonder if it needs to be merged now or later ?

easyaspi314 · 2023-03-14T21:25:42Z

So apparently marking it as a draft cancels CI tests...

easyaspi314 · 2023-03-14T22:04:47Z

Ok I think I figured everything out. I want to add various versions of the Emscripten SDK to the matrix but that isn't a huge priority. I mostly want to test the recent node.js versions.

Amusingly due to the clock() granularity (due to Spectre mitigations), xxhsum -bi0 has only a few possible entries.

easyaspi314 · 2023-03-14T22:17:29Z

This seems to be ready to merge, porting the NEON code was dead simple. Sorry I had to do it the old fashioned way because CI tests were ignored.

xxhash.h

Cyan4973

This looks good to me.
It fits well in existing code base.

I would recommend publishing a set of benchmark as part of this PR description, as it can serve as reference for future users of the WASM128 code path.

easyaspi314 · 2023-03-14T23:21:33Z

This looks good to me. It fits well in existing code base.

I would recommend publishing a set of benchmark as part of this PR description, as it can serve as reference for future users of the WASM128 code path.

A basic xxhsum -b benchmark is up now.

Also, I managed to get a bench on Firefox. It isn't as impressive but it is still a solid improvement on x86 and just a slight decrease on aarch64.

easyaspi314 · 2023-03-15T01:38:55Z

Actually....I have an idea.

emcc emulates NEON intrinsics with SIMDe, and the intrinsics are so similar that I am considering scrapping the intrinsics and using NEON with some slight modifications.

I literally get ~~the same~~ BETTER performance if I just replace the multiply in scrambleAcc and put an asm guard on the secret pointer to keep it from turning it into literals.

Screw these intrinsics. I'm putting wasm on XXH_NEON with a couple of ifdefs.

easyaspi314 · 2023-03-15T04:05:39Z

🦀 SIMD128 intrinsics are GONE 🦀

SIMDe generates the same thing, and I was actually able to make it faster, now reaching 16.3 GB/s on my phone. (Mostly due to stopping Clang from constant folding kSecret).

I also found a stupidly obvious optimization to XXH3_scrambleAcc_neon. 🤦‍♀️

I will get some new benchmarks tomorrow and probably squash.

- Link in `nodefs` and `noderawfs` - Use Node's `tty.isatty()` via inline JS instead of the broken libc `isatty()` - Used with `make NODE_JS=1`

Currently only one version of EMCC, testing node 16, 17, and 18. Cache is used because the emsdk has to cache each library which takes a bit.

easyaspi314 · 2023-03-15T09:17:32Z

Apparently I can't figure out how git rebase works and my commits doubled lol

Edit: fixed. Aside from the CI test commit there is no trace of the intrinsics now 😏

The Emscripten SDK includes arm_neon.h from SIMDeverywhere to port NEON projects. Since SIMD128 is very similar to NEON without half vectors, the double NEON path maps perfectly to WASM SIMD128 and can reach full speed with little modification. (As a matter of fact I was able to make it slightly faster). Note that if XXH3_NEON_LANES is not a multiple of 4, SIMDe will scalarize the single width NEON paths, so 8 is strongly recommended. Also, I found an optimization to scrambleAcc which should have been obvious.

It JITs to the same thing that slowed down SSE4 and NEON.

easyaspi314 · 2023-03-15T09:58:58Z

Ok so apparently aarch64 Firefox has slowed down even more.

However, I am not that concerned since aarch64 is likely going to be using V8 or WebKit anyways — Firefox for Android has like a 0.6% usage share lol, and it is still a reasonable speed. Looking over the source code it might be that the compiler can't hoist the shuffle masks out of the loop, but there isn't much I can do. Shuffle masks are inlined with the instruction. The only ones that would probably use it are ARM Macs and AArch64 Windows/Linux users.

~~Also I want to see how Safari (WebKit) on iOS fares since emrun can host a web server.~~ Apparently WebKit doesn't support SIMD.

easyaspi314 · 2023-03-16T05:57:46Z

The last commit is just documentation.

Cyan4973 · 2023-06-30T08:03:56Z

Coming back to this topic, trying to re-acquire the context.

Initially, this PR was providing a new vector code path for WASM.
Doing so would improve performance on WASM target, which currently can only use the scalar code path.
This code was initially reviewed, and accepted.

Quickly after, @easyaspi314 discovered that the WASM compiler emcc could actually ingest NEON intrinsics directly, resulting in similar performance, but removing the need for additional dedicated code path, which is good for maintenance.
The PR was then substantially altered to take advantage of this property, resulting in a generally shortened implementation.

Then, it was discovered that the "similar performance" (as WASM SIMD128) statement is not entirely correct.
This is true when running with node.js for aarch64 target.
But when it comes to x64, some performance is lost (compared to WASM SIMD128), because NEON intrinsics are different enough from AVX ones. The counter to this point is that using NEON intrinsics for WASM still delivers some hefty speed benefit compared to scalar, so the PR is still a net progress.
Another problematic issue is that Firefox for Android actually suffers from the new NEON intrinsic code path, even as the local target is aarch64, to the point of being even slower than scalar code path, making this PR no longer a "guaranteed positive". The counter to this point is to state that Firefox has probably a negligible presence on aarch64.
It would have been appropriate therefore to complete the picture by presenting benchmarks using another system (Chrome maybe?) that is presumed more representative of aarch64 scenario.

Anyway, this is what I recall from memory, and since this PR's content has changed substantially since it was first validated, it deserves another review.

@easyaspi314 , is that a correct assessment of the situation ?

easyaspi314 · 2023-06-30T19:50:54Z

V8 and NodeJS use Chrome's JavaScript engine.

Browser	Render Engine	JS engine
Chrome, Chromium	Chromium	V8
Edge	Chromium	V8
Opera, Brave, etc	Chromium	V8
Node.JS	N/A	V8
Firefox	Gecko	SpiderMonkey
Safari	WebKit	JavaScriptCore

I am not that concerned about the Firefox AArch64 problem:

Firefox AArch64 is quite uncommon:

Apple mandates that all iOS browsers use WebKit, so Firefox isn't using its own engine
Firefox on Android is niche, with only 100M downloads despite being on the Play Store since 2010

The only thing preventing the same 16 GB/s as V8 is a missed optimization we don't have control over (see also - MSVC x86). However, browsers are constantly updated so the performance will only improve even if the binaries aren't recompiled. (Judging from the performance it looks like SIMD isn't/wasn't ready for AArch64 and it was scalarized)
Since it is so uncommon, I would put it in the category of the XXH32 AArch64 issue, where we give up max performance on specific targets (this case Firefox instead of the A53 and X1) for better/more balanced performance on the majority

I should update and run some new benchmarks on the latest versions, as well as get a Safari benchmark as iOS 16 should have SIMD now. (Unfortunately my MBP is out of commission so Safari x64 might be tricky to bench unless I can find a WebKit impl for Windows)

As for validation, the NEON intrinsics we use are literally a 1:1 translation to SIMD128 aside from the umlal needing an additional add, so it is as optimal as wasm intrinsics unless I am missing something.

easyaspi314 · 2023-06-30T23:08:37Z

Found that I can run WebKit via Epiphany on Linux.

WebKit x64 isn't doing too hot.

AMD Ryzen 5 3600
Ubuntu 22.04 LTS on WSL2 
Epiphany Technology Preview 45.alpha-2-g08992b26f+
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.4 Safari/605.1.15

scalar

this.program 0.8.1 by Yann Collet
compiled as 32-bit wasm/asmjs little endian with Clang 17.0.0
Sample of 100 KB...
 1#XXH32                         :     102400 ->    62087 it/s ( 6063.2 MB/s)
 3#XXH64                         :     102400 ->   117833 it/s (11507.1 MB/s)
 5#XXH3_64b                      :     102400 ->    79804 it/s ( 7793.4 MB/s)
11#XXH128                        :     102400 ->    77467 it/s ( 7565.1 MB/s)

simd128

this.program 0.8.1 by Yann Collet
compiled as 32-bit wasm/asmjs + simd128 little endian with Clang 17.0.0
Sample of 100 KB...
 1#XXH32                         :     102400 ->    62770 it/s ( 6129.9 MB/s)
 3#XXH64                         :     102400 ->   123634 it/s (12073.7 MB/s)
 5#XXH3_64b                      :     102400 ->    50899 it/s ( 4970.7 MB/s)
11#XXH128                        :     102400 ->    51623 it/s ( 5041.3 MB/s)

Apparently, this is because WebKit converts extmul to zero extend + full multiply without a special case, and then it is forced to scalarize because it will only emit a i64x2 multiply if AVX-512 is enabled. (NEON, x64). I might open up a PR/issue for this to be added because this is pathetic 💀

However, I don't quite understand what is up with Firefox's aarch64 jit. I might have to boot up my raspi so I can get the JIT debugger. (Edit: Really? No Linux aarch64 builds?! Do I seriously have to install tiny11 for this?)

easyaspi314 · 2023-07-01T06:15:09Z

Well I got a build of Firefox's jsshell for AArch64 that can dump native code, but the standard emcc wrappers don't work with it. I might have to write my own WASM glue. 💀

I am truly baffled on how the performance is worse, as the reason isn't blatant like WebKit — all instructions should be lowered to the direct NEON counterparts.

Also I tried adjusting the JIT parameters on Firefox, to no avail. And it isn't baseline jit being used because when I disable optimizing jit it is only 2.9 GB/s.

Cyan4973 · 2023-07-05T08:16:04Z

So, to summarize the situation,
this PR influences performance of XXH3 when compiled for WASM,
under Node.JS, it's much faster (2x-3x),
but under Web Browser environment, whether it is Chrome of Firefox,
it's only a bit faster for x64 cpus (+50%)
but it's slower for aarch64 cpus (-20-30%)
even though the WASM code path uses NEON intrinsics.

Correct ?

easyaspi314 · 2023-07-06T04:23:50Z

So, to summarize the situation,
this PR influences performance of XXH3 when compiled for WASM,
under Node.JS, it's much faster (2x-3x),
but under Web Browser environment, whether it is Chrome of Firefox,
it's only a bit faster for x64 cpus (+50%)
but it's slower for aarch64 cpus (-20-30%)
even though the WASM code path uses NEON intrinsics.

Correct ?

Basically although a little mixed up.

For all intents and purposes, Chrome and Node.JS are the same. I have confirmed that the benchmarks are nearly identical because they use the same JS engine (V8) under the hood. Chrome pairs V8 with the Chromium HTML renderer, Node.JS pairs V8 with an OS interface.

Rough diagram of the browser situation:

"Common name"	JS Engine	Paired HTML engine	Uses	aarch64	x64	Notes
V8	V8	Chromium	Node.js, Chrome, Chromium (Browser), Opera, Brave, Vivaldi, etc	much faster	faster	Most popular, most important
Firefox	SpiderMonkey	Gecko	Firefox, Tor Browser, Pale Moon, etc	slower	faster	AArch64 rare, simple JIT bug?
WebKit	JavaScriptCore	WebKit	Safari, Epiphany	slower (speculation)	slower	SIMD is new, simple JIT bug.

Firefox and Chrome (and derivatives) are on a rapid release cycle, and the only time you see an old JS engine is an older version of Node.JS (which doesn't matter because it is already a performance improvement), an unsupported OS, or someone who refuses to update. You rarely come into the issue where someone is stuck with an old toolchain as well because of how emcc is distributed.

WebKit only recently added SIMD128 so it is unsurprising that there are a few kinks. This one is very simple and could be fixed in a few hours once I figure out how to contribute.

Firefox AArch64 is probably also a simple bug that I can't tell from the source code itself, given how it should be a direct translation. I mostly need to find a minimal reproducible example for the codegen bug to report it.

However, the important thing is that the code that is generated is optimal on the WASM level, and this is a problem with the upstream browsers. I say merge and then we try to fix the Firefox and WebKit bugs.

Additionally, the most important target is V8 because most browsers are Chromium based, and Node.JS will affect server-side performance which is very significant on a large server.

Cyan4973 · 2023-07-06T09:50:59Z

Makefile

@@ -55,6 +55,12 @@ else
 EXT =
 endif

+ifeq ($(NODE_JS),1)


This environment variable NODE_JS=1 must be set at compilation time.
It should probably be documented in README.md.

Cyan4973 · 2023-07-09T22:06:23Z

I'm going to insist a bit regarding the documentation of the new NODE_JS build variable,
since I don't know if I would be able to complete it myself.
I see NODE_JS used in tests, but I don't know if this variable is meant to be used only for tests, or meant to be used only with node.js specifically, if it can ignored at (presumably) the cost of performance, etc.

easyaspi314 · 2023-07-10T00:15:01Z

Done. I also did things a bit more correctly, added a helpful note to the make check output if it fails, and removed the RUN_ENV override.

NODE_JS adds proper detection of terminals (as the default emsdk's isatty is a no-op) and more importantly gives unrestricted access to the filesystem.

easyaspi314 mentioned this pull request Mar 14, 2023

WebAssembly SIMD128 support #824

Closed

easyaspi314 changed the title ~~Add preliminary WebAssembly SIMD128 implementation~~ [WIP] Add WebAssembly SIMD128 implementation Mar 14, 2023

easyaspi314 marked this pull request as draft March 14, 2023 21:24

easyaspi314 marked this pull request as ready for review March 14, 2023 21:25

easyaspi314 force-pushed the wasm128 branch from d3574ef to 72821c2 Compare March 14, 2023 21:32

easyaspi314 marked this pull request as draft March 14, 2023 21:37

easyaspi314 force-pushed the wasm128 branch from 72821c2 to 7929912 Compare March 14, 2023 21:37

easyaspi314 marked this pull request as ready for review March 14, 2023 21:37

easyaspi314 force-pushed the wasm128 branch 3 times, most recently from e5ceb21 to 23bca55 Compare March 14, 2023 21:55

easyaspi314 changed the title ~~[WIP] Add WebAssembly SIMD128 implementation~~ Add WebAssembly SIMD128 implementation and Node.JS support Mar 14, 2023

Cyan4973 reviewed Mar 14, 2023

View reviewed changes

xxhash.h Outdated Show resolved Hide resolved

Cyan4973 reviewed Mar 14, 2023

View reviewed changes

xxhash.h Outdated Show resolved Hide resolved

Cyan4973 approved these changes Mar 14, 2023

View reviewed changes

easyaspi314 added 2 commits March 15, 2023 01:35

Add some nodejs workarounds so make check works

207e2ab

- Link in `nodefs` and `noderawfs` - Use Node's `tty.isatty()` via inline JS instead of the broken libc `isatty()` - Used with `make NODE_JS=1`

Add node.js wasm/asm.js tests for Github CI

125f89f

Currently only one version of EMCC, testing node 16, 17, and 18. Cache is used because the emsdk has to cache each library which takes a bit.

easyaspi314 added 2 commits March 15, 2023 05:26

Compiler guard XXH32 on WASM128

ab50132

It JITs to the same thing that slowed down SSE4 and NEON.

easyaspi314 force-pushed the wasm128 branch from d435bf7 to ab50132 Compare March 15, 2023 09:27

Fix xxh_x86dispatch.c again

5a9427c

easyaspi314 added 2 commits March 15, 2023 09:34

Proper documentation of the xsecret asm guard, finish comment

fd63843

Document WASM SIMD128, rename an unclear macro.

082a9a8

Cyan4973 mentioned this pull request May 6, 2022

Suggestions list for future evolutions #458

Open

Cyan4973 reviewed Jul 6, 2023

View reviewed changes

Cyan4973 approved these changes Jul 6, 2023

View reviewed changes

Document NODE_JS, remove RUN_ENV hack, use proper -sNODERAWFS flag

4f092f2

Cyan4973 merged commit c234b94 into Cyan4973:dev Jul 10, 2023
57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WebAssembly SIMD128 implementation and Node.JS support #825

Add WebAssembly SIMD128 implementation and Node.JS support #825

easyaspi314 commented Mar 14, 2023 •

edited

Loading

Cyan4973 commented Mar 14, 2023 •

edited

Loading

easyaspi314 commented Mar 14, 2023

easyaspi314 commented Mar 14, 2023

easyaspi314 commented Mar 14, 2023

Cyan4973 left a comment

easyaspi314 commented Mar 14, 2023 •

edited

Loading

easyaspi314 commented Mar 15, 2023 •

edited

Loading

easyaspi314 commented Mar 15, 2023 •

edited

Loading

easyaspi314 commented Mar 15, 2023 •

edited

Loading

easyaspi314 commented Mar 15, 2023 •

edited

Loading

easyaspi314 commented Mar 16, 2023

Cyan4973 commented Jun 30, 2023 •

edited

Loading

easyaspi314 commented Jun 30, 2023 •

edited

Loading

easyaspi314 commented Jun 30, 2023 •

edited

Loading

easyaspi314 commented Jul 1, 2023 •

edited

Loading

Cyan4973 commented Jul 5, 2023

easyaspi314 commented Jul 6, 2023 •

edited

Loading

Cyan4973 Jul 6, 2023

Cyan4973 commented Jul 9, 2023

easyaspi314 commented Jul 10, 2023 •

edited

Loading

Add WebAssembly SIMD128 implementation and Node.JS support #825

Add WebAssembly SIMD128 implementation and Node.JS support #825

Conversation

easyaspi314 commented Mar 14, 2023 • edited Loading

Cyan4973 commented Mar 14, 2023 • edited Loading

easyaspi314 commented Mar 14, 2023

easyaspi314 commented Mar 14, 2023

easyaspi314 commented Mar 14, 2023

Cyan4973 left a comment

Choose a reason for hiding this comment

easyaspi314 commented Mar 14, 2023 • edited Loading

easyaspi314 commented Mar 15, 2023 • edited Loading

easyaspi314 commented Mar 15, 2023 • edited Loading

easyaspi314 commented Mar 15, 2023 • edited Loading

easyaspi314 commented Mar 15, 2023 • edited Loading

easyaspi314 commented Mar 16, 2023

Cyan4973 commented Jun 30, 2023 • edited Loading

easyaspi314 commented Jun 30, 2023 • edited Loading

easyaspi314 commented Jun 30, 2023 • edited Loading

easyaspi314 commented Jul 1, 2023 • edited Loading

Cyan4973 commented Jul 5, 2023

easyaspi314 commented Jul 6, 2023 • edited Loading

Cyan4973 Jul 6, 2023

Choose a reason for hiding this comment

Cyan4973 commented Jul 9, 2023

easyaspi314 commented Jul 10, 2023 • edited Loading

easyaspi314 commented Mar 14, 2023 •

edited

Loading

Cyan4973 commented Mar 14, 2023 •

edited

Loading

easyaspi314 commented Mar 14, 2023 •

edited

Loading

easyaspi314 commented Mar 15, 2023 •

edited

Loading

easyaspi314 commented Mar 15, 2023 •

edited

Loading

easyaspi314 commented Mar 15, 2023 •

edited

Loading

easyaspi314 commented Mar 15, 2023 •

edited

Loading

Cyan4973 commented Jun 30, 2023 •

edited

Loading

easyaspi314 commented Jun 30, 2023 •

edited

Loading

easyaspi314 commented Jun 30, 2023 •

edited

Loading

easyaspi314 commented Jul 1, 2023 •

edited

Loading

easyaspi314 commented Jul 6, 2023 •

edited

Loading

easyaspi314 commented Jul 10, 2023 •

edited

Loading