llama.cpp

Inference of Meta's LLaMA model (and others) in pure C/C++

[2024 Apr 21] llama_token_to_piece can now optionally render special tokens #6807
[2024 Apr 4] State and session file functions reorganized under llama_state_* #6341
[2024 Mar 26] Logits and embeddings API updated for compactness #6122
[2024 Mar 13] Add llama_synchronize() + llama_context_params.n_ubatch #6017
[2024 Mar 8] llama_kv_cache_seq_rm() returns a bool instead of void, and new llama_n_seq_max() returns the upper limit of acceptable seq_id in batches (relevant when dealing with multiple sequences) #5328
[2024 Mar 4] Embeddings API updated #5796
[2024 Mar 3] struct llama_context_params #5849

MoE memory layout has been updated - reconvert models for mmap support and regenerate imatrix #6387
Model sharding instructions using gguf-split #6404
Fix major bug in Metal batched inference #6225
Multi-GPU pipeline parallelism support #6017