llama.cr

Crystal bindings for llama.cpp, a C/C++ implementation of LLaMA, Falcon, GPT-2, and other large language models.

The version in shard.yml corresponds to the compatible llama.cpp build number.

This project is under active development and may change rapidly.

Versioning Policy

This library version tracks the upstream llama.cpp build number.
The version in shard.yml uses the numeric build value (for example 8119).
Git tags use the v<build> format (for example v8119).
Compatibility target is one upstream build at a time.
Consumers should pin an exact shard version (for example 8119), not a version range.

Features

Low-level bindings to the llama.cpp C API
High-level Crystal wrapper classes for easy usage
Memory management for C resources
Simple text generation interface
Advanced sampling methods (Min-P, Typical, Mirostat, etc.)
Batch processing for efficient token handling
KV cache management for optimized inference
State saving and loading

Installation

Prerequisites

You need the llama.cpp shared library (libllama) available on your system.

1. Download Prebuilt Binary (Recommended)

LLAMA_BUILD="b$(shards version)"
curl -L "https://github.com/ggml-org/llama.cpp/releases/download/${LLAMA_BUILD}/llama-${LLAMA_BUILD}-bin-ubuntu-x64.tar.gz" -o llama.tar.gz
tar -xzf llama.tar.gz
sudo cp llama-${LLAMA_BUILD}/*.so* /usr/local/lib/
sudo ldconfig

For macOS, replace ubuntu-x64 with macos-arm64 and *.so with *.dylib.

Alternative: Use local libraries with standard linker flags

If you prefer not to install system-wide, point Crystal and the runtime loader to your local llama.cpp library directory:

export LLAMA_LIB_DIR=/path/to/llama.cpp
LIBRARY_PATH="$LLAMA_LIB_DIR" crystal build examples/simple.cr --link-flags "-L$LLAMA_LIB_DIR -Wl,-rpath,$LLAMA_LIB_DIR -lllama -lggml"
LD_LIBRARY_PATH="$LLAMA_LIB_DIR" ./simple --model models/tiny_model.gguf

On macOS, replace LD_LIBRARY_PATH with DYLD_LIBRARY_PATH.

If backend auto-detection fails in newer llama.cpp builds, also set GGML_BACKEND_PATH to a backend shared library file (not a directory), for example:

export GGML_BACKEND_PATH="$LLAMA_LIB_DIR/libggml-cpu-haswell.so"

For local development/tests, a full example is:

MODEL_PATH=/path/to/model.gguf \
LIBRARY_PATH="$LLAMA_LIB_DIR" \
LD_LIBRARY_PATH="$LLAMA_LIB_DIR" \
GGML_BACKEND_PATH="$LLAMA_LIB_DIR/libggml-cpu-haswell.so" \
crystal spec

Minimal examples:

# Linux
LIBRARY_PATH="$LLAMA_LIB_DIR" crystal build examples/simple.cr --link-flags "-L$LLAMA_LIB_DIR -Wl,-rpath,$LLAMA_LIB_DIR -lllama -lggml"
LD_LIBRARY_PATH="$LLAMA_LIB_DIR" ./simple --model models/tiny_model.gguf

# macOS
LIBRARY_PATH="$LLAMA_LIB_DIR" crystal build examples/simple.cr --link-flags "-L$LLAMA_LIB_DIR -Wl,-rpath,$LLAMA_LIB_DIR -lllama -lggml"
DYLD_LIBRARY_PATH="$LLAMA_LIB_DIR" ./simple --model models/tiny_model.gguf

Build from source (advanced users)

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
LLAMA_BUILD="b$(shards version ..)"
git checkout "${LLAMA_BUILD}"
mkdir build && cd build
cmake .. && cmake --build . --config Release
sudo cmake --install . && sudo ldconfig

Obtaining GGUF Model Files

You'll need a model file in GGUF format. For testing, smaller quantized models (1-3B parameters) with Q4_K_M quantization are recommended.

Popular options:

Adding to Your Project

Add the dependency to your shard.yml:

We strongly recommend pinning an exact version because llama.cpp updates can include breaking changes between build numbers.

dependencies:
  llama:
    github: kojix2/llama.cr
    version: 8119

Then run shards install.

Usage

Basic Text Generation

require "llama"

# Load a model
model = Llama::Model.new("/path/to/model.gguf")

# Create a context
context = model.context

# Generate text
response = context.generate("Once upon a time", max_tokens: 100, temperature: 0.8)
puts response

# Or use the convenience method
response = Llama.generate("/path/to/model.gguf", "Once upon a time")
puts response

Advanced Sampling

require "llama"

model = Llama::Model.new("/path/to/model.gguf")
context = model.context

# Create a sampler chain with multiple sampling methods
chain = Llama::SamplerChain.new
chain.add(Llama::Sampler::TopK.new(40))
chain.add(Llama::Sampler::MinP.new(0.05, 1))
chain.add(Llama::Sampler::Temp.new(0.8))
chain.add(Llama::Sampler::Dist.new(42))

# Generate text with the custom sampler chain
result = context.generate_with_sampler("Write a short poem about AI:", chain, 150)
puts result

Chat Conversations

require "llama"
require "llama/chat"

model = Llama::Model.new("/path/to/model.gguf")
context = model.context

# Create a chat conversation
messages = [
  Llama::ChatMessage.new("system", "You are a helpful assistant."),
  Llama::ChatMessage.new("user", "Hello, who are you?")
]

# Generate a response
response = context.chat(messages)
puts "Assistant: #{response}"

# Continue the conversation
messages << Llama::ChatMessage.new("assistant", response)
messages << Llama::ChatMessage.new("user", "Tell me a joke")
response = context.chat(messages)
puts "Assistant: #{response}"

Embeddings

require "llama"

model = Llama::Model.new("/path/to/model.gguf")

# Create a context with embeddings enabled
context = model.context(embeddings: true)

# Get embeddings for text
text = "Hello, world!"
tokens = model.vocab.tokenize(text)
batch = Llama::Batch.get_one(tokens)
context.decode(batch)
embeddings = context.get_embeddings_seq(0)

puts "Embedding dimension: #{embeddings.size}"

Utilities

System Info

puts Llama.system_info

Tokenization Utility

model = Llama::Model.new("/path/to/model.gguf")
puts Llama.tokenize_and_format(model.vocab, "Hello, world!", ids_only: true)

Examples

The examples directory contains sample code demonstrating various features:

simple.cr - Basic text generation
chat.cr - Chat conversations with models
tokenize.cr - Tokenization and vocabulary features

API Documentation

See kojix2.github.io/llama.cr for full API docs.

Core Classes

Llama::Model - Represents a loaded LLaMA model
Llama::Context - Handles inference state for a model
Llama::Vocab - Provides access to the model's vocabulary
Llama::Batch - Manages batches of tokens for efficient processing
Llama::KvCache - Controls the key-value cache for optimized inference
Llama::State - Handles saving and loading model state
Llama::SamplerChain - Combines multiple sampling methods

Samplers

Llama::Sampler::TopK - Keeps only the top K most likely tokens
Llama::Sampler::TopP - Nucleus sampling (keeps tokens until cumulative probability exceeds P)
Llama::Sampler::Temp - Applies temperature to logits
Llama::Sampler::Dist - Samples from the final probability distribution
Llama::Sampler::MinP - Keeps tokens with probability >= P * max_probability
Llama::Sampler::Typical - Selects tokens based on their "typicality" (entropy)
Llama::Sampler::Mirostat - Dynamically adjusts sampling to maintain target entropy
Llama::Sampler::Penalties - Applies penalties to reduce repetition

Development

See DEVELOPMENT.md for development guidelines.

This software is primarily created through AI-generated code.

Do you need commit rights?

If you need commit rights to my repository or want to get admin rights and take over the project, please feel free to contact @kojix2.
Many OSS projects become abandoned because only the founder has commit rights to the original repository.

Contributing

Fork it (https://github.com/kojix2/llama.cr/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

License

This project is available under the MIT License. See the LICENSE file for more info.