Under the Hood: How llama.cpp Powers MyLLM
A technical look at how MyLLM uses llama.cpp, JNI, and GGUF quantization to run billion-parameter models on mobile hardware.
The Engine Behind MyLLM
At the heart of MyLLM is llama.cpp — Georgi Gerganov's incredible C++ library that makes LLM inference possible on consumer hardware. Let's dive into how we use it to run AI models on your Android phone.
What is llama.cpp?
llama.cpp is a pure C/C++ implementation of LLM inference. It was originally built to run Meta's LLaMA model on MacBooks, but has since grown into a universal inference engine supporting dozens of model architectures.
Key advantages:
- Pure C++ — No Python, no heavy frameworks
- CPU-optimized — Uses SIMD instructions (NEON on ARM) for fast matrix operations
- Memory-efficient — Supports quantization to dramatically reduce memory usage
- Cross-platform — Works on Linux, macOS, Windows, Android, iOS
The JNI Bridge
Android apps are written in Kotlin/Java, but llama.cpp is C++. We bridge this gap using JNI (Java Native Interface):
- C++ layer — Compiled llama.cpp with NDK for arm64-v8a and x86_64
- JNI wrapper — C functions that translate between Java types and C++ types
- Kotlin API — Clean Kotlin interface that the rest of the app uses
This architecture means the inference engine runs at near-native speed — no interpreter overhead, no virtual machine penalties.
Memory-Mapped Model Loading
One of llama.cpp's cleverest features is memory-mapped file loading (mmap). Instead of reading the entire model into RAM:
- The model file is mapped to virtual memory
- Only the needed portions are loaded into physical RAM
- The OS handles paging automatically
- This dramatically reduces startup time and memory usage
A 2.8 GB model file might only use 1.5-2 GB of actual RAM during inference.
Quantization Explained
Full-precision models (FP16) are too large for phones. Quantization compresses the model by reducing numerical precision:
How It Works
- FP16 (full precision): Each weight is a 16-bit float → Large files, best quality
- Q8_0: Each weight is 8 bits → ~50% size reduction, near-original quality
- Q5_K_M: Each weight is ~5.5 bits → ~65% reduction, very good quality
- Q4_K_M: Each weight is ~4.5 bits → ~72% reduction, good quality
MyLLM defaults to Q4_K_M because it offers the best quality-to-size ratio. The quality loss is barely perceptible for most tasks.
The Inference Pipeline
When you send a message in MyLLM, here's what happens:
1. Tokenization
Your text is converted into tokens — numerical representations that the model understands. "Hello, how are you?" might become [15496, 11, 703, 527, 498, 30].
2. Prompt Formatting (ChatML)
MyLLM uses the ChatML format to structure conversations:
- System prompt defines the AI's behavior
- User messages are wrapped with special tokens
- Assistant responses are properly delimited
3. Forward Pass
The tokens are fed through the neural network's layers. On a modern phone with a 4B model:
- ~32 transformer layers
- Each layer: self-attention + feed-forward network
- Billions of multiply-accumulate operations per token
- NEON SIMD instructions process multiple values simultaneously
4. Sampling
The model outputs probabilities for the next token. MyLLM uses sophisticated sampling:
- Temperature — Controls randomness (lower = more deterministic)
- Top-K — Only considers the K most likely tokens
- Top-P — Considers tokens until cumulative probability exceeds P
- Repetition penalty — Prevents the model from repeating itself
5. Token-by-Token Generation
Each new token is generated one at a time, fed back into the model to generate the next token. This is why you see text appearing word-by-word in the chat.
Performance Optimizations
We've made several optimizations for mobile:
- Thread pool management — Optimal thread count based on device capabilities
- KV cache management — Efficient key-value cache for multi-turn conversations
- Memory pressure handling — Graceful degradation when RAM is low
- Thermal management — Reduced throughput when the device gets hot
Open Source
Both llama.cpp and MyLLM are open source. You can explore the code, suggest improvements, or build your own AI apps on top of this foundation.
Download MyLLM AI →MyLLM AI Team
Building the future of private, on-device AI. We believe AI should run on your phone, respect your privacy, and be free for everyone.