Skip to content

Commit

Permalink
Speedup size hash for 17to128 by using independent accumulators.
Browse files Browse the repository at this point in the history
Faster (better IPC) to make computation slightly more independent.

Results are from geometric mean of N = 6 runs.

Machine:
```
11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
```

CC:
```
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
```

Bench Command:
```
// Change is only for 17-128 range.
$> ./benchHash xxh3 --mins=1 --maxs=256 --minl=0 --maxl=0
```

Aggregated Results for [17, 128] size range:
Times reported as geometric mean of all speedups.

Latency for small inputs of fixed size :
    - 1.073
Throughput small inputs of fixed size (from 1 to 256 bytes):
    - 1.173
Latency for small inputs of random size [1-N] :
    - 1.051
benchmarking random size inputs [1-N] :
    - 1.134

So roughly 5-17% improvement.
  • Loading branch information
goldsteinn committed Nov 9, 2022
1 parent b2929c4 commit 7221be4
Showing 1 changed file with 11 additions and 9 deletions.
20 changes: 11 additions & 9 deletions xxhash.h
Original file line number Diff line number Diff line change
Expand Up @@ -3960,31 +3960,33 @@ XXH3_len_17to128_64b(const xxh_u8* XXH_RESTRICT input, size_t len,
XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN); (void)secretSize;
XXH_ASSERT(16 < len && len <= 128);

{ xxh_u64 acc = len * XXH_PRIME64_1;
{ xxh_u64 acc = len * XXH_PRIME64_1, acc_end;
#if XXH_SIZE_OPT >= 1
/* Smaller and cleaner, but slightly slower. */
unsigned int i = (unsigned int)(len - 1) / 32;
do {
acc += XXH3_mix16B(input+16 * i, secret+32*i, seed);
acc += XXH3_mix16B(input+len-16*(i+1), secret+32*i+16, seed);
} while (i-- != 0);
acc_end = 0;
#else
acc += XXH3_mix16B(input+0, secret+0, seed);
acc_end = XXH3_mix16B(input+len-16, secret+16, seed);
if (len > 32) {
acc += XXH3_mix16B(input+16, secret+32, seed);
acc_end += XXH3_mix16B(input+len-32, secret+48, seed);
if (len > 64) {
acc += XXH3_mix16B(input+32, secret+64, seed);
acc_end += XXH3_mix16B(input+len-48, secret+80, seed);

if (len > 96) {
acc += XXH3_mix16B(input+48, secret+96, seed);
acc += XXH3_mix16B(input+len-64, secret+112, seed);
acc_end += XXH3_mix16B(input+len-64, secret+112, seed);
}
acc += XXH3_mix16B(input+32, secret+64, seed);
acc += XXH3_mix16B(input+len-48, secret+80, seed);
}
acc += XXH3_mix16B(input+16, secret+32, seed);
acc += XXH3_mix16B(input+len-32, secret+48, seed);
}
acc += XXH3_mix16B(input+0, secret+0, seed);
acc += XXH3_mix16B(input+len-16, secret+16, seed);
#endif
return XXH3_avalanche(acc);
return XXH3_avalanche(acc + acc_end);
}
}

Expand Down

15 comments on commit 7221be4

@Cyan4973
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be that this strategy is better for more modern cpus, such as Tiger Lake, and more recent compilers, such as gcc-11.
On skylake cpus, using an older gcc-9, this strategy seems less advantageous.

Surprisingly, I would have expected a relatively modern architecture like M1 Pro to benefit from this more parallel design,
but I don't see any benefit either.
Could it be that the compiler already figured-it out ?

@goldsteinn
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be that this strategy is better for more modern cpus, such as Tiger Lake, and more recent compilers, such as gcc-11. On skylake cpus, using an older gcc-9, this strategy seems less advantageous.

I would guess its more CPU related.
Looking at the disas of:
gcc-9 -march=skylake -O3 vs gcc-11 -march=tigerlake -O3 produces very similar result.

Neither gcc9/gcc11 are using the 2x accums before this change and both are using
2x accums after this change although both gcc9/gcc11 compute the 2x
mix16B in parallel
.

There is a serialization point between each branch, however, whereas the new
code doesn't have that.

The biggest difference in the ASM, is the BB structure.
The ASM basically follows the C-code so the new code has a lot more
unconditional execution/less jumps.

I would be curious what the following perf counters on SKL look like?
branch-misses, lsd.uops, idq.dsb_uops, dsb2mite_switches.count

as it may be that the new code layout/br struct causes some catostrophic
FE layout.

Surprisingly, I would have expected a relatively modern architecture like M1 Pro to benefit from this more parallel design, but I don't see any benefit either. Could it be that the compiler already figured-it out ?

Will look at that later, although not really familiar enough with the M1 ISA to absolutely know.
What clang-version where you using? And was it apples version (which has HW tuning) or asahi linux?

@goldsteinn
Copy link
Contributor Author

@goldsteinn goldsteinn commented on 7221be4 Jul 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be that this strategy is better for more modern cpus, such as Tiger Lake, and more recent compilers, such as gcc-11. On skylake cpus, using an older gcc-9, this strategy seems less advantageous.

Surprisingly, I would have expected a relatively modern architecture like M1 Pro to benefit from this more parallel design, but I don't see any benefit either. Could it be that the compiler already figured-it out ?

Are you seeing a regression, or just no difference?

@easyaspi314
Copy link
Contributor

@easyaspi314 easyaspi314 commented on 7221be4 Jul 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, I would have expected a relatively modern architecture like M1 Pro to benefit from this more parallel design,
but I don't see any benefit either.

As I mentioned in the other issue, AArch64 is always going to be bottlenecked by the multiplies.

x64 has mulq which is usually really fast (3 cycles on recent AMD/Intel P-core, 6 on Intel E-cores), and calculates the full 128-bit result. (And the mov to deal with RAX:RDX is usually zero-cycle register renaming)

Microarch Name Type Latency Throughput
Prescott Pentium 4 x64 Garbage 11 ?
Merom/Wolfdale Core 2 General 7 1/4
Nehalem Gen1 Core General 3 1/2
Sandy Bridge and later Gen2+ Core General 3 1
Atom Atom x64 Efficiency 14 1/14
Silvermont Atom Bay Trail Efficiency 7 1/7
Goldmont and later Lake E-cores Efficiency 6 1/2
Xeon Phi Xeon Phi AVX512 HPC 8 ?
K8 Opteron General 4-5 1/2
Bulldozer FX-6xxx General 6 1/4
Excavator FX-9xxx General 5 1/4
Bobcat E350 Efficiency 6-7 ?
Jaguar A4-5000, PS4/XB1 Efficiency 6 1/5
Zen Ryzen 1xxx All 3 1/2
Zen 2 and later Ryzen 3xxx All 3 1

aarch64 needs to multiply the low and high half independently, and currently no processors I know of can fuse them. Therefore, more time is spent on latency cycles and stalls since it has to wait for two multiplies and compilers like to put the low and high half together in case of future fusing.

Processor Internal multiplier Performance target Latency mul Throughput mul Latency umulh Throughput umulh Notes
Cortex-A55 32-bit Efficiency 4-5 1/3-1/2 6 1/2 in-order
Cortex-A57/A72 32-bit Midrange 5 1/3 6 1/4 umulh 3 cycle stall
Cortex-A76 32-bit Midrange 4 1/3 5 1/4 umulh 2 cycle stall
Cortex-X1 64-bit Performance 2 2 3 2
Apple M1 Fire, Ice 64-bit Performance, Midrange 3 2 3 2
Fujitsu A64FX 64-bit Performance 5 1? 5 1? documents pipelines used instead of throughput

@Cyan4973
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you seeing a regression, or just no difference?

Just no difference

@goldsteinn
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you seeing a regression, or just no difference?

Just no difference

I see.

Looking at the disas, however, I'd say the second acc_end variable isn't really needed and
probably does more to interfere with register allocation than anything else.

Might post patch for it soon.

@Cyan4973
Copy link
Owner

@Cyan4973 Cyan4973 commented on 7221be4 Jul 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From follow up experimentations,
it looks like the majority of the speed impact can be attributed to
the changed input access pattern implemented within the nested if() { ... }

Note that 2 accumulators vs 1 can also contribute to speed impact, but rather modestly by comparison.

In contrast, the change of input read order can result in wild swings.
But outcomes vary wildly though,
from sometimes no impact at all, to significant ones,
depending on exact compiler, version, and flags.

Note that I also find that the new if() nest order, implemented in this comment, feels more "natural" to read,
since it mimics the behavior of the simpler XXH_SIZE_OPT variant,
which simply reads more data with each loop iteration.
So readability is rather improved. But the impact on performance is non trivial unfortunately.

For the time being, it's essentially an observation.
I don't have a formal explanation yet.
I'm suspecting a relation with input read ptr predictability.
Maybe analyzing hw counters would help to produce a more accurate diagnosis.

It also gives an obvious solution : just revert the input access pattern to what it was before this commit.
This would restore most of the performance as it was.

@goldsteinn
Copy link
Contributor Author

@goldsteinn goldsteinn commented on 7221be4 Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From follow up experimentations, it looks like the majority of the speed impact can be attributed to the changed input access pattern implemented within the nested if() { ... }

Proper regressions?

Random, latency, or throughput benchmarks?

Unfortunately no SKL at hand to reproduce on

Note that 2 accumulators vs 1 can also contribute to speed impact, but rather modestly by comparison.

In contrast, the change of input read order can result in wild swings. But outcomes vary wildly though, from sometimes no impact at all, to significant ones, depending on exact compiler, version, and flags.

Note that I also find that the new if() nest order, implemented in this comment, feels more "natural" to read, since it mimics the behavior of the simpler XXH_SIZE_OPT variant, which simply reads more data with each loop iteration. So readability is rather improved. But the impact on performance is non trivial unfortunately.

For the time being, it's essentially an observation. I don't have a formal explanation yet. I'm suspecting a relation with input read ptr predictability. Maybe analyzing hw counters would help to produce a more accurate diagnosis.

I would bet its not about the ptr access pattern, but the FE counters.
On TGL there is no regression for things like lsd, dsb, mite, branch-misses, etc..., but that may not be true
for SKL. At that point, however, alignment of the caller/exact code layout start to play a big deal so generally
it will be configuration A is faster for certain alignments and B is faster for others.

It also gives an obvious solution : just revert the input access pattern to what it was before this commit. This would restore most of the performance as it was.

@Cyan4973
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would bet its not about the ptr access pattern, but the FE counters. On TGL there is no regression for things like lsd, dsb, mite, branch-misses, etc..., but that may not be true for SKL. At that point, however, alignment of the caller/exact code layout start to play a big deal so generally it will be configuration A is faster for certain alignments and B is faster for others.

Side-effects of Instruction Address alignment are a bane for benchmarking high performance loops.
If that's what happens here, I would expect some of these side-effects to leave visible traces using a well-chosen list of hardware counters.

Another aspect to consider is that, if one expression of the nested if() is more susceptible to these random side-effects, and the other is not, it's still beneficial to select the least vulnerable expression.

@goldsteinn
Copy link
Contributor Author

@goldsteinn goldsteinn commented on 7221be4 Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would bet its not about the ptr access pattern, but the FE counters. On TGL there is no regression for things like lsd, dsb, mite, branch-misses, etc..., but that may not be true for SKL. At that point, however, alignment of the caller/exact code layout start to play a big deal so generally it will be configuration A is faster for certain alignments and B is faster for others.

Side-effects of Instruction Address alignment are a bane for benchmarking high performance loops. If that's what happens here, I would expect some of these side-effects to leave visible traces using a well-chosen list of hardware counters.

So the above counters are generally signals for this sort of things. I would expect at least one of them is
different before/after the change on SKL.

Another aspect to consider is that, if one expression of the nested if() is more susceptible to these random side-effects, and the other is not, it's still beneficial to select the least vulnerable expression.

Maybe, although its pretty arbitrary. You are deciding the size of the common head/common tail BBs.

If the 2x unconditional mix16B come first, the head BB is large (less freedom) but the tail is larger (more freedom).
if the 2x unconditional mix16B come last, the head BB is small (more freedom) but the tail is large (less freedom).
more/less freedom for inlining (or outlining a common prefix).

Now I'd argue since every BB MUST go through the function entry, making that one larger has less side affect on
compiler freedom. But there is a meaningful decision about inlining/outlining the tail so it would make slightly slightly
slightly more sense to keep it smaller.

The most important factor in this case, however, is just the BB order and which instructions are in them, but that
is important the same way a butterfly flapping its wings is important. It may have dramatic consequences, but not
in any reasonably predictable manner.

If you look at actual codegen, the 2x unconditional at the end produces an extra unconditional jmp for each of the
BB (except size <= 32) so more jumps probably implies more can go wrong in the FE. But in this case, the difference
in BB structures in the IR neither enforces (nor implies) that. As well two or three branches happen to hash to the same BTB entry that will ultimately be what plays a big role and can happen either way (by "luck" given the codegen logic of either LLVM or GCC AFAIK, minus of course inline asm which has plenty of its own de-optimizing characteristics (that rarely show up in microbenchmarks) or function alignment but we want this function to inline)

Edit:
Its also worth noting these FE states are whole program dependent, so where the branches in the benchmark loop are ultimately as important. IMHO it makes sense to optimize these thing if you control the inner AND outer loop, but not if you only control one beyond maybe "less basic blocks is better"

@goldsteinn
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want patch to drop second accum?

@Cyan4973
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want patch to drop second accum?

Not necessarily, it seems to have less impact.
I will make an experiment to determine.

@goldsteinn
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want patch to drop second accum?

Not necessarily, it seems to have less impact. I will make an experiment to determine.

Okay. Just based on the ASM (and some benchmarks on TGL), it seems more
likely to get in the way of register allocation than doing any good. So my thinking was
unless microbenchmarks (which typically don't stress register allocation much) really
pushes for it, it makes sense to drop. But LMK what you decide (or just post it yourself :)

@Cyan4973
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After tests, I tried with double accumulator, but the impact was negative.
I think I will simply reverse this change, and keep the single-accumulator design, coupled with the original nest order.
This is the simplest way to restore performance as it was.

It will still be possible to revisit this topic after the release.

@goldsteinn
Copy link
Contributor Author

@goldsteinn goldsteinn commented on 7221be4 Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link cpu/compiler details?

cpu : i7-9700K
compiler : Ubuntu clang version 12.0.0-3ubuntu1~20.04.5

I tried multiple compilers, but the one above felt the more "stable" and therefore I used it more.

I also used gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1) quite a bit, but was more concerned by the large variations I could measure on produced binaries.

Please sign in to comment.