Skip to content

Releases: finalfusion/finalfusion-rust

0.18.0

10 Oct 10:18
Compare
Choose a tag to compare

This release updates all dependencies, upgrades to Rust 2021, and modernizes the code base up to current Clippy standards.

Thanks to @djc for doing all the work on this release!

0.17.2

12 Dec 19:26
Compare
Choose a tag to compare
  • Add WriteEmbeddings::write_embeddings_len. This method returns the serialized length of embeddings in finalfusion format, without performing any serialization.
  • Add WriteChunk::chunk_len. This method returns the serialized length of a finalfusion chunk, without performing any serialization.
  • Switch the license to Apache License 2.0 or MIT

Add support for Floret embeddings

04 Dec 11:41
Compare
Choose a tag to compare
  • Add support for reading, writing, and using Floret embeddings.
  • Add a finalfusion chunk type for Floret-like vocabularies.
  • Add support for batched embedding lookups (embedding_batch and embedding_batch_into)
  • Improve error handling:
    • Mark wrapped errors using #[source] to get better chains of error messages.
    • Split Error::Io into Error::Read and Error::Write.
    • Rename some Error variants.

Subword vocabulary conversion

15 Aug 06:46
Compare
Choose a tag to compare
  • Add conversion from bucketed subword to explicit subword embeddings.
  • Hide WordSimilarityResult fields. Use the cosine_similarity and word methods instead.

Faster lookup of OPQ-quantized embeddings

09 Jun 08:06
Compare
Choose a tag to compare
  • Make lookups of unknown words in OPQ-quantized embedding matrices 2.6x faster (resulting in ~1.6x faster allround lookups).
  • Add the Reconstruct trait is a counterpart to Quantize. This trait can be used to reconstruct quantized embedding matrices. Using this trait is also much faster than reconstructing individual embeddings.
  • Add more I/O checks to ensure that the embedding matrix can actually be represented in the native usize.

Improved error handling

09 Jun 08:01
Compare
Choose a tag to compare

Modernize and improve error handling

  • Merge the Error and ErrorKind enums.
  • Move the Error enum to the error module.
  • Derive trait implementations using the thiserror crate.
  • Make the Error enum non-exhaustive
  • Replace the ChunkIdentifier::try_from method by an implementation of the TryFrom crate.

This release also feature-gates the memmap dependency (the memmap feature is enabled by default).

Explicit n-gram vocabularies and first API-stable release

26 Oct 07:14
Compare
Choose a tag to compare
  • Add ExplicitVocab, a subword vocabulary that stores n-grams explicitly.
  • Add the Embedding::into method. This method realizes an embedding into a user-provided array.
  • Support big-endian architectures.
  • Add WordIndex::word and WordIndex::subword methods. These will return an Option with the word index or subword indices, as applicable.
  • Expose the quantizer in (Mmap)QuantizedArray through the quantizer method.
  • Add benchmarks for array and quantized embeddings.
  • Split WordSimilarity into WordSimilarity and WordSimilarityBy; EmbeddingSimilarity into EmbeddingSimilarity and EmbeddingSimilarityBy.
  • Rename FinalfusionSubwordVocab to BucketSubwordVocab.
  • Expose fewer types through the prelude.
  • Hide the chunks module. E.g. chunks::storage becomes storage.

Reductive 0.3

16 Sep 07:37
Compare
Choose a tag to compare

This is a small update, that updates the reductive dependency to 0.3, which has a crucial bug fix for training product quantizers in multiple attempts. However, reductive 0.3 also requires rand 0.7, resulting in a changed API. Therefore, we have to bump the leading version number from 0.9 to 0.10.

Memory-mapped quantized arrays

04 Sep 09:13
0.9.0
Compare
Choose a tag to compare
  • Add the MmapQuantizedArray storage type.
  • Rename Vocab::len to Vocab::words_len.
  • Add Vocab::vocab_len to get the vocabulary size including subword
    indices.

Token robustness

13 Aug 08:44
Compare
Choose a tag to compare
  • Improve reading of embeddings that contain unicode whitespace in tokens.
  • Add lossy variants of the text/word2vec/fasttext reading methods. The lossy variants read tokens with invalid UTF-8 byte sequences.