Skip to content

Releases: ngxson/wllama

1.16.3

07 Oct 09:49
ffcd98a
Compare
Choose a tag to compare

What's Changed

  • sync to latest upstream source code by @ngxson in #125

Thanks to a small refactoring on llama.cpp, be binary size is now reduced from 1.78MB to 1.52MB

Full Changelog: 1.16.2...1.16.3

1.16.2

23 Sep 16:15
d9b849e
Compare
Choose a tag to compare

What's Changed

  • decode/encode : do not fail on empty batch by @ngxson in #118
  • Update to latest llama.cpp source code by @ngxson in #119

Full Changelog: 1.16.1...1.16.2

1.16.1

06 Sep 14:29
7beefeb
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.16.0...1.16.1

1.16.0

19 Aug 10:04
Compare
Choose a tag to compare

SmolLM-360m is added as a model in main example. Try it now --> https://huggingface.co/spaces/ngxson/wllama

Special thanks to @huggingface team for providing a such powerful model in a very small size!

Screenshot 2024-08-19 at 11 35 22

What's Changed

  • ability to use custom cacheManager by @ngxson in #109

Full Changelog: 1.15.0...1.16.0

1.15.0

03 Aug 20:34
667dd91
Compare
Choose a tag to compare

New features

downloadModel()

Download model to cache without loading it. The use case would be to allow application to have a "model manager" screen that allows:

  • Download model via downloadModel()
  • List all downloaded models using CacheManager.list()
  • Delete a downloaded model using CacheManager.delete()

KV cache reuse in createCompletion

When calling createCompletion, you can pass useCache: true as an option. It will reuse the KV cache from the last createCompletion call. It is equivalent to cache_prompt option on llama.cpp server.

wllama.createCompletion(input, {
  useCache: true,
  ...
});

For example:

  • On the first call, you have 2 messages: user: hello, assistant: hi
  • On the second call, you add one message: user: hello, assistant: hi, user: who are you?

Then, only the added message user: who are you? will need to be evaluated.

What's Changed

Full Changelog: 1.14.2...1.15.0

1.14.2

28 Jul 11:39
d15748b
Compare
Choose a tag to compare

Update to latest upstream llama.cpp source code:

  • Fix support for llama-3.1, phi 3 and SmolLM

Full Changelog: 1.14.0...1.14.2

1.14.0

10 Jul 11:51
94ebb81
Compare
Choose a tag to compare

What's Changed

  • save ETag metadata, add allowOffline option in #90
  • Added experimental support for encoder-decoder architecture #91

Full Changelog: 1.13.0...1.14.0

1.13.0

03 Jul 15:13
44a4de5
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 1.12.1...1.13.0

1.12.1

27 Jun 20:49
b847495
Compare
Choose a tag to compare

What's Changed

  • Sync with latest upstream source code + adapt to project structure change by @ngxson in #77

Full Changelog: 1.12.0...1.12.1

1.12.0

24 Jun 15:29
896c160
Compare
Choose a tag to compare

Important

In prior versions, if you initialize wllama with embeddings: true, you will still able to generate completions.

From v1.12.0, if you start wllama with embeddings: true, this will throws an error when you try to use createCompletion. You must add wllama.setOptions({ embeddings: false }) to turn of embeddings.

More details: This feature is introduced in ggerganov/llama.cpp#7477 , which allows models like GritLM to be used for both embeddings and text generation.

What's Changed

Full Changelog: 1.11.0...1.12.0