llama.cpp

GGUF
Filename extension	.gguf
Magic number	0x47 0x47 0x55 0x46
Developed by	Georgi Gerganov and community
Initial release	August 22, 2023; 16 months ago
Latest release	v3
Type of format	Machine-learning tensors

llama.cpp
Original author(s)	Georgi Gerganov
Developer(s)	Georgi Gerganov and community
Initial release	March 10, 2023; 21 months ago
Repository	github.com/ggerganov/llama.cpp
Written in	C++, C
Type	Library for large language models
License	MIT License

llama.cpp is an open source software library that performs inference on various large language models such as Llama.^[3] It is co-developed alongside the GGML project, a general-purpose tensor library.^[4]

Command-line tools are included with the library,^[5] alongside a server with a simple web interface.^[6]^[7]

Background

Towards the end of September 2022, Georgi Gerganov started work on the GGML library, a C library implementing tensor algebra. Gerganov developed the library with the intention of strict memory management and multi-threading. The creation of GGML was inspired by Fabrice Bellard's work on LibNC.^[8]

Before llama.cpp, Gerganov worked on a similar library called whisper.cpp which implemented Whisper, a speech to text model by OpenAI.^[9]

Gerganov has a background in medical physics, and was part of the Faculty of Physics in Sofia University.^[10] In 2006 he won a silver medal in the International Physics Olympiad.^[11]^[12] In 2008 he won a programming competition organized by the Bulgarian Association of Software Companies, PC Magazine and Musala Soft, a Bulgarian software services company.^[13]

Development

llama.cpp began development in March 2023 by Georgi Gerganov as an implementation of the Llama inference code in pure C/C++ with no dependencies. This improved performance on computers without GPU or other dedicated hardware, which was a goal of the project.^[3]^[14]^[15] llama.cpp gained traction with users who lacked specialized hardware as it could run on just a CPU including on Android devices.^[14]^[16]^[17] While initially designed for CPUs, GPU inference support was later added.^[18] As of November 2024 it has more than 67,000 stars on GitHub.^[19]

In March 2024 Justine Tunney introduced new optimized matrix multiplication kernels for x86 and ARM CPUs, improving prompt evaluation performance for FP16 and 8-bit quantized data types.^[20] These improvements were committed upstream to llama.cpp.^[20] Tunney also created a tool called llamafile that bundles models and llama.cpp into a single file that runs on multiple operating systems via the Cosmopolitan Libc library also created by Tunney which allows C/C++ to be more portable across operating systems.^[20]

Architecture

llama.cpp supports multiple hardware targets including x86, ARM, CUDA, Metal, Vulkan and SYCL.^[21]^[22]^[23]^[24] These back-ends make up the GGML tensor library which is used by the front-end model-specific llama.cpp code.^[25] llama.cpp supports ahead of time model quantization as opposed to on-the-fly quantization.^[26] llama.cpp makes use of several CPU extensions for optimization: AVX, AVX2 and AVX-512 for X86-64, and Neon on ARM. Apple silicon is an important target for the project.^[19]^[27] It supports grammar-based output formatting as JSON.^[15] It also supports speculative decoding.^[7]

GGUF file format

The GGUF (GGML Universal File)^[30] file format is a binary format that stores both tensors and metadata in a single file, and is designed for fast saving, and loading of model data.^[31] It was introduced in August 2023 by the llama.cpp project to better maintain backwards compatibility as support was added for other model architectures.^[18]^[32] It succeeded previous formats used by the project such as GGML.

GGUF files are typically created by converting models developed with a different machine learning library such as PyTorch.^[31]

Design

The format focuses on quantization, the act of reducing precision in the model weights. This can lead to reduced memory usage, and increased speed at the expense of lower model accuracy.^[33]^[32]

GGUF supports 2-bit to 8-bit quantized integer types;^[34] common floating-point data formats such as float32, float16, and bfloat16; and 1.56 bit quantization.^[5]

This file format contains information necessary for running a GPT-like language model such as the tokenizer vocabulary, context length, tensor info and other attributes.^[35]

Supported models

References

^ "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.
^ "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.
^ ^a ^b Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.
^ Gerganov, Georgi (17 May 2024). "ggerganov/ggml". GitHub.
^ ^a ^b Mann, Tobias (14 Jul 2024). "Honey, I shrunk the LLM! A beginner's guide to quantization – and testing it". theregister.
^ Alden, Daroc. "Portable LLMs with llamafile [LWN.net]". lwn.net. Retrieved 30 July 2024.
^ ^a ^b Mann, Tobias (15 December 2024). "Intro to speculative decoding: Cheat codes for faster LLMs". theregister.
^ "Bringing Whisper and LLaMA to the masses with Georgi Gerganov (Changelog Interviews #532)". Changelog. 22 March 2023. Retrieved 28 July 2024.
^ "ggerganov/whisper.cpp". GitHub.
^ Mitev, Krasimir; Gerganov, Georgi; Kirov, Assen S.; Schmidtlein, C. Ross; Madzhunkov, Yordan; Kawrakow, Iwan (21 June 2012). "Influence of photon energy cuts on PET Monte Carlo simulation results: Influence of photon energy cuts". Medical Physics. 39 (7Part1): 4175–4186. doi:10.1118/1.4725168.
^ Tichy-Rács, Ádám (2015). LIST OF WINNERS IN 1ST – 45TH INTERNATIONAL PHYSICS OLYMPIADS. BME OMIKK. p. 246. ISBN 978-963-593-500-0.
^ Захариев, Боян (July 21, 2006). "България с 11 медала от международни олимпиади". sega bg.
^ Станева, Ирина (January 30, 2008). "Студенти от СУ спечелиха конкурса по програмиране на БАСКОМ". karieri.
^ ^a ^b Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.
^ ^a ^b Wiest, Isabella Catharina; Ferber, Dyke; Zhu, Jiefu; van Treeck, Marko; Meyer, Meyer, Sonja K.; Juglan, Radhika; Carrero, Zunamys I.; Paech, Daniel; Kleesiek, Jens; Ebert, Matthias P.; Truhn, Daniel; Kather, Jakob Nikolas (2024). "Privacy-preserving large language models for structured medical information retrieval". npj Digital Medicine. 7 (257). doi:10.1038/s41746-024-01233-2. PMC 11415382.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Hood, Stephen. "llamafile: bringing LLMs to the people, and to your own computer". Mozilla Innovations. Retrieved 28 July 2024.
^ "Democratizing AI with open-source language models". lwn.net. Retrieved 28 July 2024.
^ ^a ^b Rajput, Saurabhsingh; Sharma, Tushar (4 June 2024). "Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency". 2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C). pp. 238–242. doi:10.1109/ICSA-C63560.2024.00049. ISBN 979-8-3503-6625-9.
^ ^a ^b "ggerganov/llama.cpp". GitHub.
^ ^a ^b ^c Connatser, Matthew. "Llamafile LLM driver project boosts performance on CPU cores". www.theregister.com. Retrieved 10 May 2024.
^ Gerganov, Georgi; Nguyen, Xuan Son; Slaren (August 13, 2024). "Introduction to ggml". Huggingface.
^ Kluska, Piotr; Castell´o, Adri´an; Scheidegger, Florian; I. Malossi, A. Cristiano; Quintana-Ort´ı, Enrique (June 2024). "QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers" (PDF). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
^ Fritts, Harold (30 October 2024). "AMD and LM Studio: Making AI Accessible and Fast on x86 Laptops". StorageReview.com.
^ Jianyu, Zhang; Hengyu, Meng; Ying, Hu; Yu, Luo; Xiaoping, Duan; Corporation, Majumder Abhilash Intel (July 2024). "Run LLMs on Intel GPUs Using llama.cpp". The Parallel Universe. No. 57. Intel. pp. 34–37.
^ Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.
^ Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525. Retrieved 8 May 2024.
^ Larabel, Michael. "Llamafile 0.7 Brings AVX-512 Support: 10x Faster Prompt Eval Times For AMD Zen 4". www.phoronix.com.
^ "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.
^ "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.
^ "ggerganov/llama.cpp/gguf-py/README.md". GitHub. Retrieved 10 November 2024.
^ ^a ^b "GGUF". huggingface.co. Retrieved 9 May 2024.
^ ^a ^b Mucci, Tim (3 July 2024). "GGUF versus GGML". www.ibm.com. Retrieved 26 July 2024.
^ Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.
^ Cabezas, Darío; Fonseca-Delgado, Rigoberto; Reyes-Chacón, Iván; Vizcaino-Imacaña, Paulina; Morocho-Cayamcela, Manuel (2024). "Integrating a LLaMa-based Chatbot with Augmented Retrieval Generation as a Complementary Educational Tool for High School and College Students". Proceedings of the 19th International Conference on Software Technologies. pp. 395–402. doi:10.5220/0012763000003753. ISBN 978-989-758-706-1.
^ Dong, Bo; Lin, Jun; Yu, Zhentao; Xu, Zhenzhong; Luo, Yu; Chang, Hanwen; Shen, Haihao (July 2024). "Accelerating GGUF Models with Transformers". The Parallel Universe. No. 57. Intel. pp. 28–33.

[githubrelease-1] "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.

[license-2] "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.

[register-llamafile-3] Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.

[ggml-4] Gerganov, Georgi (17 May 2024). "ggerganov/ggml". GitHub.

[theregister_14_Jul_2024-5] Mann, Tobias (14 Jul 2024). "Honey, I shrunk the LLM! A beginner's guide to quantization – and testing it". theregister.

[lwn-6] Alden, Daroc. "Portable LLMs with llamafile [LWN.net]". lwn.net. Retrieved 30 July 2024.

[theregister_15_December_2024-7] Mann, Tobias (15 December 2024). "Intro to speculative decoding: Cheat codes for faster LLMs". theregister.

[changelog-podcast-mar-2023-8] "Bringing Whisper and LLaMA to the masses with Georgi Gerganov (Changelog Interviews #532)". Changelog. 22 March 2023. Retrieved 28 July 2024.

[whisper-9] "ggerganov/whisper.cpp". GitHub.

[Medical_Physics_21_June_2012-10] Mitev, Krasimir; Gerganov, Georgi; Kirov, Assen S.; Schmidtlein, C. Ross; Madzhunkov, Yordan; Kawrakow, Iwan (21 June 2012). "Influence of photon energy cuts on PET Monte Carlo simulation results: Influence of photon energy cuts". Medical Physics. 39 (7Part1): 4175–4186. doi:10.1118/1.4725168.

[Tichy-Rács-11] Tichy-Rács, Ádám (2015). LIST OF WINNERS IN 1ST – 45TH INTERNATIONAL PHYSICS OLYMPIADS. BME OMIKK. p. 246. ISBN 978-963-593-500-0.

[sega_bg_July_21,_2006-12] Захариев, Боян (July 21, 2006). "България с 11 медала от международни олимпиади". sega bg.

[karieri-13] Станева, Ирина (January 30, 2008). "Студенти от СУ спечелиха конкурса по програмиране на БАСКОМ". karieri.

[arstechnica-14] Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.

[Wiest-15] Wiest, Isabella Catharina; Ferber, Dyke; Zhu, Jiefu; van Treeck, Marko; Meyer, Meyer, Sonja K.; Juglan, Radhika; Carrero, Zunamys I.; Paech, Daniel; Kleesiek, Jens; Ebert, Matthias P.; Truhn, Daniel; Kather, Jakob Nikolas (2024). "Privacy-preserving large language models for structured medical information retrieval". npj Digital Medicine. 7 (257). doi:10.1038/s41746-024-01233-2. PMC 11415382.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[mozilla-introducing-llamafile-16] Hood, Stephen. "llamafile: bringing LLMs to the people, and to your own computer". Mozilla Innovations. Retrieved 28 July 2024.

[17] "Democratizing AI with open-source language models". lwn.net. Retrieved 28 July 2024.

[Rajput-18] Rajput, Saurabhsingh; Sharma, Tushar (4 June 2024). "Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency". 2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C). pp. 238–242. doi:10.1109/ICSA-C63560.2024.00049. ISBN 979-8-3503-6625-9.

[llama.cpprepo-19] "ggerganov/llama.cpp". GitHub.

[llamafileregister-20] Connatser, Matthew. "Llamafile LLM driver project boosts performance on CPU cores". www.theregister.com. Retrieved 10 May 2024.

[Gerganov_Slaren_Nguyen_Introduction_to_ggml-21] Gerganov, Georgi; Nguyen, Xuan Son; Slaren (August 13, 2024). "Introduction to ggml". Huggingface.

[Kluska-22] Kluska, Piotr; Castell´o, Adri´an; Scheidegger, Florian; I. Malossi, A. Cristiano; Quintana-Ort´ı, Enrique (June 2024). "QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers" (PDF). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.

[StorageReview_October_2024-23] Fritts, Harold (30 October 2024). "AMD and LM Studio: Making AI Accessible and Fast on x86 Laptops". StorageReview.com.

[Run_LLMs_on_Intel_GPUs_Using_llama.cpp-24] Jianyu, Zhang; Hengyu, Meng; Ying, Hu; Yu, Luo; Xiaoping, Duan; Corporation, Majumder Abhilash Intel (July 2024). "Run LLMs on Intel GPUs Using llama.cpp". The Parallel Universe. No. 57. Intel. pp. 34–37.

[tomshardware-25] Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.

[Walkowiak-26] Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525. Retrieved 8 May 2024.

[phoronix-llamafile-27] Larabel, Michael. "Llamafile 0.7 Brings AVX-512 Support: 10x Faster Prompt Eval Times For AMD Zen 4". www.phoronix.com.

[githubgguf-28] "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.

[ggufdoc-29] "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.

[gguf-py-30] "ggerganov/llama.cpp/gguf-py/README.md". GitHub. Retrieved 10 November 2024.

[huggingface-31] "GGUF". huggingface.co. Retrieved 9 May 2024.

[ibm-gguf-vs-ggml-32] Mucci, Tim (3 July 2024). "GGUF versus GGML". www.ibm.com. Retrieved 26 July 2024.

[towardsdatascience-33] Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.

[Cabezas-34] Cabezas, Darío; Fonseca-Delgado, Rigoberto; Reyes-Chacón, Iván; Vizcaino-Imacaña, Paulina; Morocho-Cayamcela, Manuel (2024). "Integrating a LLaMa-based Chatbot with Augmented Retrieval Generation as a Complementary Educational Tool for High School and College Students". Proceedings of the 19th International Conference on Software Technologies. pp. 395–402. doi:10.5220/0012763000003753. ISBN 978-989-758-706-1.

[Accelerating_GGUF_Models_with_Transformers-35] Dong, Bo; Lin, Jun; Yu, Zhentao; Xu, Zhenzhong; Luo, Yu; Chang, Hanwen; Shen, Haihao (July 2024). "Accelerating GGUF Models with Transformers". The Parallel Universe. No. 57. Intel. pp. 28–33.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]