The Engine Behind MyLLM

At the heart of MyLLM is llama.cpp — Georgi Gerganov's incredible C++ library that makes LLM inference possible on consumer hardware. Let's dive into how we use it to run AI models on your Android phone.

What is llama.cpp?

llama.cpp is a pure C/C++ implementation of LLM inference. It was originally built to run Meta's LLaMA model on MacBooks, but has since grown into a universal inference engine supporting dozens of model architectures.

Key advantages:

Pure C++ — No Python, no heavy frameworks
CPU-optimized — Uses SIMD instructions (NEON on ARM) for fast matrix operations
Memory-efficient — Supports quantization to dramatically reduce memory usage
Cross-platform — Works on Linux, macOS, Windows, Android, iOS

The JNI Bridge

Android apps are written in Kotlin/Java, but llama.cpp is C++. We bridge this gap using JNI (Java Native Interface):

C++ layer — Compiled llama.cpp with NDK for arm64-v8a and x86_64
JNI wrapper — C functions that translate between Java types and C++ types
Kotlin API — Clean Kotlin interface that the rest of the app uses

This architecture means the inference engine runs at near-native speed — no interpreter overhead, no virtual machine penalties.

Memory-Mapped Model Loading

One of llama.cpp's cleverest features is memory-mapped file loading (mmap). Instead of reading the entire model into RAM:

The model file is mapped to virtual memory
Only the needed portions are loaded into physical RAM
The OS handles paging automatically
This dramatically reduces startup time and memory usage

A 2.8 GB model file might only use 1.5-2 GB of actual RAM during inference.

Quantization Explained

Full-precision models (FP16) are too large for phones. Quantization compresses the model by reducing numerical precision:

How It Works

FP16 (full precision): Each weight is a 16-bit float → Large files, best quality
Q8_0: Each weight is 8 bits → ~50% size reduction, near-original quality
Q5_K_M: Each weight is ~5.5 bits → ~65% reduction, very good quality
Q4_K_M: Each weight is ~4.5 bits → ~72% reduction, good quality

MyLLM defaults to Q4_K_M because it offers the best quality-to-size ratio. The quality loss is barely perceptible for most tasks.

The Inference Pipeline

When you send a message in MyLLM, here's what happens:

1. Tokenization

Your text is converted into tokens — numerical representations that the model understands. "Hello, how are you?" might become [15496, 11, 703, 527, 498, 30].

2. Prompt Formatting (ChatML)

MyLLM uses the ChatML format to structure conversations:

System prompt defines the AI's behavior
User messages are wrapped with special tokens
Assistant responses are properly delimited

3. Forward Pass

The tokens are fed through the neural network's layers. On a modern phone with a 4B model:

~32 transformer layers
Each layer: self-attention + feed-forward network
Billions of multiply-accumulate operations per token
NEON SIMD instructions process multiple values simultaneously

4. Sampling

The model outputs probabilities for the next token. MyLLM uses sophisticated sampling:

Temperature — Controls randomness (lower = more deterministic)
Top-K — Only considers the K most likely tokens
Top-P — Considers tokens until cumulative probability exceeds P
Repetition penalty — Prevents the model from repeating itself

5. Token-by-Token Generation

Each new token is generated one at a time, fed back into the model to generate the next token. This is why you see text appearing word-by-word in the chat.

Performance Optimizations

We've made several optimizations for mobile:

Thread pool management — Optimal thread count based on device capabilities
KV cache management — Efficient key-value cache for multi-turn conversations
Memory pressure handling — Graceful degradation when RAM is low
Thermal management — Reduced throughput when the device gets hot

Open Source

Both llama.cpp and MyLLM are open source. You can explore the code, suggest improvements, or build your own AI apps on top of this foundation.

Download MyLLM AI →

Under the Hood: How llama.cpp Powers MyLLM