Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a typo that cause inconsistent hash between streaming and stateless way for XXH3 128-bit variant with custom secret and seed 0 #894

Merged
merged 2 commits into from
May 7, 2024

Conversation

hltj
Copy link
Contributor

@hltj hltj commented Nov 4, 2023

I found that the streaming way returns different hash from XXH3_128bits_withSecretandSeed() with seed 0 and a custom secret (not XXH3_kSecret) when the input length not greater than XXH3_MIDSIZE_MAX, while the 64-bit variant works fine.

I read the code and felt that there may be a typo that if there is also if (state->useSeed) in XXH3_128bits_digest(), it will be fine too.
I also added the test cases for that _withSecretandSeed is the same as _withSeed() when len <= XXH3_MIDSIZE_MAX for both 64-bit and 128-bit variant.

@t-mat
Copy link
Contributor

t-mat commented Nov 5, 2023

Hi @hltj , thank you for the report and PR.

I also have confirmed your issue and this PR with the following code:
https://gist.github.com/t-mat/700503ca450a4e997e04af84b97fcb7f

  • XXH3_128bits_digest() has a bug.
  • It only happens when
    • length is less than or equal to XXH3_MIDSIZE_MAX (240).
    • XXH3_state_t::seed is 0.
  • This PR # 894 fixes it effectively.

note:


@Cyan4973 , @easyaspi314

Although this PR fixes the bug, it is a breaking change, technically.
I mean if this fix will be merged and released in the next release,
XXH3_128bits_digest() may return "different hash value" suddenly.

So, I think we should think how we can/should mitigate it.

  • Just fix it.
  • Fix it. But also change major version to indicate breaking change / incompatibility in semver way.
    • We're still in 0.x though.
  • Don't fix it directly, but set [[deprecated]]. And add new API (or API set) which works properly.
  • Add compile time option which fixes this bug.
    • We should decide the default value/behavior of it.

@hltj
Copy link
Contributor Author

hltj commented Nov 5, 2023

Hi @t-mat , thank you so much for confirming, explaining the details as well as the fix/release suggestions.

The only I want to clarify that it is not XXH3_128bits_withSecretandSeed()'s issue, but XXH3_128bits_digest's. And the change will not break the behavior of XXH3_128bits_withSecretandSeed(), but it will break the behavior of XXH3_128bits_digest() in the corner case.

@t-mat
Copy link
Contributor

t-mat commented Nov 5, 2023

@hltj thanks! You're right. I've fixed my comment to prevent further misunderstanding.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Nov 5, 2023

  • Just fix it.
  • Fix it. But also change major version to indicate breaking change / incompatibility in semver way.
  • We're still in 0.x though.
  • Don't fix it directly, but set [[deprecated]]. And add new API (or API set) which works properly.
  • Add compile time option which fixes this bug.
  • We should decide the default value/behavior of it.

Idea: Since secretAndSeed is the only thing broken afaik, we only need one API to wrap.

  1. Make a new API for the secretAndSeed reset
  2. Macro define it to a new name:
// do the namespace and stuff
XXH_DEPRECATED void XXH3_reset_withSecretandSeed();
void XXH3_reset_withSecretandSeed_new();
// end of code
#define XXH3_reset_withSecretandSeed_old XXH3_reset_withSecretandSeed
#ifndef XXH3_OLD_RESET_WITH_SECRET_AND_SEED
#  define XXH3_reset_withSecretandSeed XXH3_reset_withSecretandSeed_new
#endif

We have used opt-in macros when we did the internal API rename.
3. Repurpose useSeed as stateType where we can set a tag for the old API:

enum XXH3_stateType {
    XXH3_STATE_TYPE_SECRET,
    XXH3_STATE_TYPE_SEED,
    XXH3_STATE_TYPE_OLD_SECRET_AND_SEED
};

Then in XXH3_128bits_finalize, do

if (state->stateType == XXH3_STATE_TYPE_SEED
    || XXH_unlikely(state->stateType == XXH3_STATE_TYPE_OLD_SECRET_AND_SEED && state->seed != 0)
)

And in XXH3_64bits_finalize we can do state->stateType != XXH3_STATE_TYPE_SECRET.

Make sure we make this very clear in documentation and comments.

This way, apps compiled against libxxhash.so won't change since they will use the old symbol, and defining that macro would make theoretical "old apps" that expect the broken hash values for some reason work.

@Cyan4973 Cyan4973 self-assigned this May 7, 2024
@Cyan4973
Copy link
Owner

Cyan4973 commented May 7, 2024

Coming back on this topic,
after reading the proposed code change, and the comments,

it looks to me that this is a pretty clear case of detecting and fixing a bug.

XXH3_128bits_withSecretandSeed() has a very clear contract, which is btw the same as XXH3_64bits_withSecretandSeed():

These variants generate hash values using either:

  • seed for "short" keys (< XXH3_MIDSIZE_MAX = 240 bytes)
  • secret for "large" keys (>= XXH3_MIDSIZE_MAX).

So deviating from this contract is a bug, clear and simple.

Beyond fixing the bug, what matters is to build an ability to detect such bug in the future, which this PR also provides.

So, as far as this PR is concerned, I think it's good to go.

The next question is probably: does this fix deserves a new release ? or even an urgent release ?

Well, I believe that the entry point XXH3_128bits_withSecretandSeed(), which is the only one impacted, is not employed much out there. As far as I can tell, it's fairly niche, otherwise we would have received more bug reports of this kind.
And it is likely going to remain so in the foreseeable future, since the primary motivation for this prototype was to reduce latency in the case where a seed is employed and input size is >= 240 bytes forcing the creation of a custom secret at every hash (which only matters if input is fairly small, like ~1 KB range). That's a very specific scenario.
Furthermore, since the bug is only active when seed = 0, and this case essentially defeats the point of using XXH3_128bits_withSecretandSeed() (just use regular XXH3_128bits() in this case).
edit : and now that I'm analyzing the tests, I realize there are even more conditions:

  • the secret must be different from the one generated with XXH3_generateSecret_fromSeed()
  • the bug is only active when using Streaming mode

Given this pretty huge list of pre-conditions, I don't anticipate any real-world impact.

That being said, this bug definitely deserves a mention on the release page.

@Cyan4973
Copy link
Owner

Cyan4973 commented May 7, 2024

As a side note:

Beyond fixing the bug, what matters is to build an ability to detect such bug in the future, which this PR also provides.

I was testing this capability,
and, sure it works fine for sanity_test,
but it does not for xxhsum's internal sanity test (which is quicker / less extensive).

That's because, even though these tools use almost the same code,
it's not the same code, meaning similar functions are just duplicated.

We may benefit from having a single source of truth for the common parts of the code base,
so that updates like this one benefit both targets.

Copy link
Owner

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hltj , that's a great investigation and great fix !

@Cyan4973 Cyan4973 merged commit a9b2f18 into Cyan4973:dev May 7, 2024
62 checks passed
Cyan4973 added a commit that referenced this pull request May 7, 2024
Cyan4973 added a commit that referenced this pull request May 8, 2024
Cyan4973 added a commit that referenced this pull request May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants