Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode segmenter performance #1703

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

clipperhouse
Copy link

@clipperhouse clipperhouse commented Jun 15, 2022

In this PR, as an experiment I’ve swapped out the Unicode segmenter in Bleve with this one.

Benefits

  • ~2x throughput improvement
  • ~300x lower allocation
Previous
BenchmarkTokenizeMultilingual-8   	     500	   2181603 ns/op	  49.95 MB/s	     16042 tokens	 1374693 B/op	     610 allocs/op

New
BenchmarkTokenizeMultilingual-8   	     932	   1087282 ns/op	 100.23 MB/s	     16042 tokens	 1310723 B/op	       2 allocs/op

(Run on my M2 MacBook.)
  • Updates Unicode 8 → Unicode 15

Testing & compatibility

  • Both segmenters pass the official Unicode test suites
  • Bleve tests pass
  • Wrote a test to demonstrate identical results from both segmenters on multilingual sample text
  • Added adaptors to the underlying UAX29 package to determine Bleve token types
  • Added fuzzing to the UAX29 package
  • Known differences:
    • The original segementer splits runs of spaces into separate tokens; UAX29 concatenates runs of spaces into a single token. This should be irrelevant since Bleve filters whitespace in any case.
    • The original segmenter doesn’t handle emoji skin tone modifiers, the new one does. I attribute this to the newer Unicode version, not a bug.

@abhinavdangeti
Copy link
Member

Thank you for submitting this @clipperhouse , we'll review it soon.

@clipperhouse
Copy link
Author

clipperhouse commented Jun 15, 2022

Thanks @abhinavdangeti. I did some more testing of differences between segmenters here: https://github.com/clipperhouse/segmenter-repro

@clipperhouse
Copy link
Author

Updated benchmark, using new sample text (~110K in size, multilingual). Note allocations.

Previous
BenchmarkTokenizeEnglishText-4   	     285	   4167394 ns/op	  26.15 MB/s	     16042 tokens	 1348515 B/op	     610 allocs/op

New
BenchmarkTokenizeEnglishText-4   	     614	   1986500 ns/op	  54.86 MB/s	     16042 tokens	 1310730 B/op	       2 allocs/op

@clipperhouse
Copy link
Author

@abhinavdangeti I think this is in good shape for your review. Happy to discuss.

@clipperhouse
Copy link
Author

I see that the go:build syntax isn’t compatible with old Go versions, fix incoming.

@clipperhouse
Copy link
Author

@abhinavdangeti OK, try those workflows again? Thanks.

@clipperhouse
Copy link
Author

I rebased a bit for a cleaner merge.

@clipperhouse
Copy link
Author

(Rebased)

@clipperhouse
Copy link
Author

@abhinavdangeti Friendly ping, let me know if you'd like to pursue this.

@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch 2 times, most recently from a8db884 to c8e26eb Compare July 23, 2022 03:17
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch 2 times, most recently from d4de8e6 to 2782f9f Compare August 18, 2022 21:45
@clipperhouse clipperhouse changed the title Unicode segmenter perf experiment Unicode segmenter performance Sep 10, 2022
@clipperhouse
Copy link
Author

Hiya @abhinavdangeti, the above pushes are just rebases, no updates here in a while. Would you like to run checks?

@abhinavdangeti
Copy link
Member

Thanks @clipperhouse , re-running the checks.

@clipperhouse
Copy link
Author

Looks good, thanks. Ready for review at your convenience.

Replacing blevesearch/segment. ~2x throughput improvement. Refactor allocations, now ~O(1).

Add tests & multilingual sample text to ensure identical behavior. Known differences from previous segmenter:
- The original segmenter splits runs of spaces into separate tokens; uax29 concatenates runs into a single token.
- The original segmenter doesn’t handle emoji skin tone modifiers, the new one does, attributable to Unicode version update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants