Unicode segmenter performance #1703

clipperhouse · 2022-06-15T01:28:03Z

In this PR, as an experiment I’ve swapped out the Unicode segmenter in Bleve with this one.

Benefits

~2x throughput improvement
~300x lower allocation

Previous
BenchmarkTokenizeMultilingual-8   	     500	   2181603 ns/op	  49.95 MB/s	     16042 tokens	 1374693 B/op	     610 allocs/op

New
BenchmarkTokenizeMultilingual-8   	     932	   1087282 ns/op	 100.23 MB/s	     16042 tokens	 1310723 B/op	       2 allocs/op

(Run on my M2 MacBook.)

Updates Unicode 8 → Unicode 15

Testing & compatibility

Both segmenters pass the official Unicode test suites
Bleve tests pass
Wrote a test to demonstrate identical results from both segmenters on multilingual sample text
Added adaptors to the underlying UAX29 package to determine Bleve token types
Added fuzzing to the UAX29 package
Known differences:
- The original segementer splits runs of spaces into separate tokens; UAX29 concatenates runs of spaces into a single token. This should be irrelevant since Bleve filters whitespace in any case.
- The original segmenter doesn’t handle emoji skin tone modifiers, the new one does. I attribute this to the newer Unicode version, not a bug.

abhinavdangeti · 2022-06-15T02:18:55Z

Thank you for submitting this @clipperhouse , we'll review it soon.

clipperhouse · 2022-06-15T21:39:55Z

Thanks @abhinavdangeti. I did some more testing of differences between segmenters here: https://github.com/clipperhouse/segmenter-repro

clipperhouse · 2022-06-29T02:39:26Z

Updated benchmark, using new sample text (~110K in size, multilingual). Note allocations.

Previous
BenchmarkTokenizeEnglishText-4   	     285	   4167394 ns/op	  26.15 MB/s	     16042 tokens	 1348515 B/op	     610 allocs/op

New
BenchmarkTokenizeEnglishText-4   	     614	   1986500 ns/op	  54.86 MB/s	     16042 tokens	 1310730 B/op	       2 allocs/op

clipperhouse · 2022-06-29T15:59:30Z

@abhinavdangeti I think this is in good shape for your review. Happy to discuss.

clipperhouse · 2022-06-29T16:18:51Z

I see that the go:build syntax isn’t compatible with old Go versions, fix incoming.

clipperhouse · 2022-06-29T16:50:47Z

@abhinavdangeti OK, try those workflows again? Thanks.

clipperhouse · 2022-06-30T13:41:11Z

I rebased a bit for a cleaner merge.

clipperhouse · 2022-07-14T21:36:15Z

(Rebased)

clipperhouse · 2022-07-21T14:01:11Z

@abhinavdangeti Friendly ping, let me know if you'd like to pursue this.

clipperhouse · 2022-10-17T15:55:44Z

Hiya @abhinavdangeti, the above pushes are just rebases, no updates here in a while. Would you like to run checks?

abhinavdangeti · 2022-10-17T15:59:58Z

Thanks @clipperhouse , re-running the checks.

clipperhouse · 2022-10-17T16:19:17Z

Looks good, thanks. Ready for review at your convenience.

Replacing blevesearch/segment. ~2x throughput improvement. Refactor allocations, now ~O(1). Add tests & multilingual sample text to ensure identical behavior. Known differences from previous segmenter: - The original segmenter splits runs of spaces into separate tokens; uax29 concatenates runs into a single token. - The original segmenter doesn’t handle emoji skin tone modifiers, the new one does, attributable to Unicode version update.

clipperhouse force-pushed the unicode-segmenter-perf branch from 89f74ba to cf2065d Compare June 30, 2022 13:39

clipperhouse force-pushed the unicode-segmenter-perf branch from cf2065d to 93cc997 Compare July 14, 2022 21:35

clipperhouse force-pushed the unicode-segmenter-perf branch 2 times, most recently from a8db884 to c8e26eb Compare July 23, 2022 03:17

clipperhouse force-pushed the unicode-segmenter-perf branch 2 times, most recently from d4de8e6 to 2782f9f Compare August 18, 2022 21:45

clipperhouse changed the title ~~Unicode segmenter perf experiment~~ Unicode segmenter performance Sep 10, 2022

clipperhouse force-pushed the unicode-segmenter-perf branch from 2782f9f to f944431 Compare September 17, 2022 03:17

clipperhouse force-pushed the unicode-segmenter-perf branch from f944431 to e7dd25c Compare October 8, 2022 16:10

clipperhouse force-pushed the unicode-segmenter-perf branch from e7dd25c to b6bca0a Compare October 16, 2022 20:34

clipperhouse force-pushed the unicode-segmenter-perf branch from b6bca0a to 9b41d64 Compare October 19, 2022 21:37

clipperhouse force-pushed the unicode-segmenter-perf branch 2 times, most recently from 226cf98 to cccea71 Compare November 12, 2022 16:17

clipperhouse force-pushed the unicode-segmenter-perf branch from cccea71 to 8fa8ed9 Compare May 26, 2023 19:18

clipperhouse force-pushed the unicode-segmenter-perf branch from 8fa8ed9 to 4bfba33 Compare November 6, 2023 18:41

clipperhouse force-pushed the unicode-segmenter-perf branch from 4bfba33 to 042b2d8 Compare August 12, 2024 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode segmenter performance #1703

Unicode segmenter performance #1703

clipperhouse commented Jun 15, 2022 •

edited

Loading

abhinavdangeti commented Jun 15, 2022

clipperhouse commented Jun 15, 2022 •

edited

Loading

clipperhouse commented Jun 29, 2022

clipperhouse commented Jun 29, 2022

clipperhouse commented Jun 29, 2022

clipperhouse commented Jun 29, 2022

clipperhouse commented Jun 30, 2022

clipperhouse commented Jul 14, 2022

clipperhouse commented Jul 21, 2022

clipperhouse commented Oct 17, 2022

abhinavdangeti commented Oct 17, 2022

clipperhouse commented Oct 17, 2022

Unicode segmenter performance #1703

Are you sure you want to change the base?

Unicode segmenter performance #1703

Conversation

clipperhouse commented Jun 15, 2022 • edited Loading

Benefits

Testing & compatibility

abhinavdangeti commented Jun 15, 2022

clipperhouse commented Jun 15, 2022 • edited Loading

clipperhouse commented Jun 29, 2022

clipperhouse commented Jun 29, 2022

clipperhouse commented Jun 29, 2022

clipperhouse commented Jun 29, 2022

clipperhouse commented Jun 30, 2022

clipperhouse commented Jul 14, 2022

clipperhouse commented Jul 21, 2022

clipperhouse commented Oct 17, 2022

abhinavdangeti commented Oct 17, 2022

clipperhouse commented Oct 17, 2022

clipperhouse commented Jun 15, 2022 •

edited

Loading

clipperhouse commented Jun 15, 2022 •

edited

Loading