Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Speedup size hash for 17to128 by using independent accumulators.
Faster (better IPC) to make computation slightly more independent. Results are from geometric mean of N = 6 runs. Machine: ``` 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz ``` CC: ``` gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 ``` Bench Command: ``` // Change is only for 17-128 range. $> ./benchHash xxh3 --mins=1 --maxs=256 --minl=0 --maxl=0 ``` Aggregated Results for [17, 128] size range: Times reported as geometric mean of all speedups. Latency for small inputs of fixed size : - 1.073 Throughput small inputs of fixed size (from 1 to 256 bytes): - 1.173 Latency for small inputs of random size [1-N] : - 1.051 benchmarking random size inputs [1-N] : - 1.134 So roughly 5-17% improvement.
- Loading branch information
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be that this strategy is better for more modern cpus, such as Tiger Lake, and more recent compilers, such as
gcc-11
.On skylake cpus, using an older gcc-9, this strategy seems less advantageous.
Surprisingly, I would have expected a relatively modern architecture like M1 Pro to benefit from this more parallel design,
but I don't see any benefit either.
Could it be that the compiler already figured-it out ?
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess its more CPU related.
Looking at the disas of:
gcc-9 -march=skylake -O3
vsgcc-11 -march=tigerlake -O3
produces very similar result.Neither gcc9/gcc11 are using the 2x accums before this change and both are using
2x accums after this change although both gcc9/gcc11 compute the 2x
mix16B
in parallel.There is a serialization point between each branch, however, whereas the new
code doesn't have that.
The biggest difference in the ASM, is the BB structure.
The ASM basically follows the C-code so the new code has a lot more
unconditional execution/less jumps.
I would be curious what the following perf counters on SKL look like?
branch-misses
,lsd.uops
,idq.dsb_uops
,dsb2mite_switches.count
as it may be that the new code layout/br struct causes some catostrophic
FE layout.
Will look at that later, although not really familiar enough with the M1 ISA to absolutely know.
What clang-version where you using? And was it apples version (which has HW tuning) or asahi linux?
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you seeing a regression, or just no difference?
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in the other issue, AArch64 is always going to be bottlenecked by the multiplies.
x64 has
mulq
which is usually really fast (3 cycles on recent AMD/Intel P-core, 6 on Intel E-cores), and calculates the full 128-bit result. (And themov
to deal withRAX:RDX
is usually zero-cycle register renaming)aarch64 needs to multiply the low and high half independently, and currently no processors I know of can fuse them. Therefore, more time is spent on latency cycles and stalls since it has to wait for two multiplies and compilers like to put the low and high half together in case of future fusing.
mul
mul
umulh
umulh
umulh
3 cycle stallumulh
2 cycle stall7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just no difference
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Looking at the disas, however, I'd say the second
acc_end
variable isn't really needed andprobably does more to interfere with register allocation than anything else.
Might post patch for it soon.
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From follow up experimentations,
it looks like the majority of the speed impact can be attributed to
the changed input access pattern implemented within the nested
if() { ... }
Note that 2 accumulators vs 1 can also contribute to speed impact, but rather modestly by comparison.
In contrast, the change of input read order can result in wild swings.
But outcomes vary wildly though,
from sometimes no impact at all, to significant ones,
depending on exact compiler, version, and flags.
Note that I also find that the new
if()
nest order, implemented in this comment, feels more "natural" to read,since it mimics the behavior of the simpler
XXH_SIZE_OPT
variant,which simply reads more data with each loop iteration.
So readability is rather improved. But the impact on performance is non trivial unfortunately.
For the time being, it's essentially an observation.
I don't have a formal explanation yet.
I'm suspecting a relation with input read ptr predictability.
Maybe analyzing hw counters would help to produce a more accurate diagnosis.
It also gives an obvious solution : just revert the input access pattern to what it was before this commit.
This would restore most of the performance as it was.
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proper regressions?
Random, latency, or throughput benchmarks?
Unfortunately no SKL at hand to reproduce on
I would bet its not about the ptr access pattern, but the FE counters.
On TGL there is no regression for things like
lsd
,dsb
,mite
,branch-misses
, etc..., but that may not be truefor
SKL
. At that point, however, alignment of the caller/exact code layout start to play a big deal so generallyit will be configuration A is faster for certain alignments and B is faster for others.
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side-effects of Instruction Address alignment are a bane for benchmarking high performance loops.
If that's what happens here, I would expect some of these side-effects to leave visible traces using a well-chosen list of hardware counters.
Another aspect to consider is that, if one expression of the nested
if()
is more susceptible to these random side-effects, and the other is not, it's still beneficial to select the least vulnerable expression.7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the above counters are generally signals for this sort of things. I would expect at least one of them is
different before/after the change on SKL.
Maybe, although its pretty arbitrary. You are deciding the size of the common head/common tail BBs.
If the 2x unconditional
mix16B
come first, the head BB is large (less freedom) but the tail is larger (more freedom).if the 2x unconditional
mix16B
come last, the head BB is small (more freedom) but the tail is large (less freedom).more/less freedom for inlining (or outlining a common prefix).
Now I'd argue since every BB MUST go through the function entry, making that one larger has less side affect on
compiler freedom. But there is a meaningful decision about inlining/outlining the tail so it would make slightly slightly
slightly more sense to keep it smaller.
The most important factor in this case, however, is just the BB order and which instructions are in them, but that
is important the same way a butterfly flapping its wings is important. It may have dramatic consequences, but not
in any reasonably predictable manner.
If you look at actual codegen, the 2x unconditional at the end produces an extra unconditional jmp for each of the
BB (except size <= 32) so more jumps probably implies more can go wrong in the FE. But in this case, the difference
in BB structures in the IR neither enforces (nor implies) that. As well two or three branches happen to hash to the same BTB entry that will ultimately be what plays a big role and can happen either way (by "luck" given the codegen logic of either LLVM or GCC AFAIK, minus of course inline asm which has plenty of its own de-optimizing characteristics (that rarely show up in microbenchmarks) or function alignment but we want this function to inline)
Edit:
Its also worth noting these FE states are whole program dependent, so where the branches in the benchmark loop are ultimately as important. IMHO it makes sense to optimize these thing if you control the inner AND outer loop, but not if you only control one beyond maybe "less basic blocks is better"
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want patch to drop second accum?
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, it seems to have less impact.
I will make an experiment to determine.
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Just based on the ASM (and some benchmarks on TGL), it seems more
likely to get in the way of register allocation than doing any good. So my thinking was
unless microbenchmarks (which typically don't stress register allocation much) really
pushes for it, it makes sense to drop. But LMK what you decide (or just post it yourself :)
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After tests, I tried with double accumulator, but the impact was negative.
I think I will simply reverse this change, and keep the single-accumulator design, coupled with the original nest order.
This is the simplest way to restore performance as it was.
It will still be possible to revisit this topic after the release.
7221be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu : i7-9700K
compiler : Ubuntu clang version 12.0.0-3ubuntu1~20.04.5
I tried multiple compilers, but the one above felt the more "stable" and therefore I used it more.
I also used
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
quite a bit, but was more concerned by the large variations I could measure on produced binaries.