A Common Lisp implementation for Llama inference operations
- About the Project
- Objectives
- Built With
- Getting Started
- Prerequisites
- Installation
- Usage
- Performance
- Roadmap
- Contributing
- License
- Contact
LLAMA.CL is a Common Lisp implementation of Llama inference operations, designed for rapid experimentation, research, and as a reference implementation for the Common Lisp community. This project enables researchers and developers to explore LLM techniques within the Common Lisp ecosystem, leveraging the language's capabilities for interactive development and integration with symbolic AI systems.
-
Research-oriented interface: Provide a platform for experimenting with LLM inference techniques in an interactive development environment.
-
Reference implementation: Serve as a canonical example of implementing modern neural network inference in Common Lisp.
-
Integration capabilities: Enable seamless combination with other AI paradigms available in Common Lisp, including expert systems, graph algorithms, and constraint-based programming.
-
Simplicity and clarity: Maintain readable, idiomatic Common Lisp code that prioritizes understanding over premature optimization.
LLAMA.CL requires:
- A Common Lisp implementation (currently SBCL-only as of version 0.0.5; pull requests for other implementations are welcome)
- Quicklisp or another ASDF-compatible system loader
- Pre-trained model weights in binary format
All dependencies are available through Quicklisp.
-
Clone the repository to a location accessible to ASDF:
cd ~/common-lisp git clone https://github.com/snunez1/llama.cl.git
-
Clear the ASDF source registry to recognize the new system:
(asdf:clear-source-registry)
Download pre-trained models from Karpathy's llama2.c repository. For initial experimentation, the TinyStories models are recommended:
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.binUse Quicklisp to obtain required dependencies:
(ql:quickload :llama)Initialize and generate text using the following workflow:
;; Load the system
(ql:quickload :llama)
;; Switch to the LLAMA package
(in-package :llama)
;; Initialize with model and tokenizer
(init #P"stories15M.bin" #P"tokenizer.bin" 32000)
;; Generate text
(generate *model* *tokenizer*)The system supports various generation parameters including temperature control, custom prompts, and different sampling strategies. Consult the source code for detailed parameter specifications.
The implementation has been validated with models up to llama-2-7B. Larger models may require additional optimization or hardware acceleration.
On a reference system Intel(R) Core(TM) Ultra 7 155H 16/22 cores, 32GB DDR4 RAM), the stories110M model achieves approximately 3 tokens/second using SBCL and common lisp along and 22 tokens/sec with SBCL+LLA with 9 threads for lparallel and 3 for MKL BLAS.
Performance characteristics vary based on model size and hardware configuration. For the stories15M model, parallelization overhead may exceed benefits on some systems. See the file benchmarks.md for benchmarking instructions. You'll want to tune the lparallel and BLAS number of threads to find the sweet spot for you machine and model.
- Extend compatibility to additional Common Lisp implementations
- Add support for quantized models
Contributions are welcome. Please submit pull requests for bug fixes, performance improvements, or additional Common Lisp implementation support. See the project's issue tracker for current priorities.
Distributed under the MIT License. See LICENSE for more information.
Project Link: https://github.com/snunez1/llama.cl