Skip to main content
Back to blog
TechnicalFebruary 8, 20267 min read
Share:

Under the Hood: How llama.cpp Powers MyLLM

A technical look at how MyLLM uses llama.cpp, JNI, and GGUF quantization to run billion-parameter models on mobile hardware.

The Engine Behind MyLLM

At the heart of MyLLM is llama.cpp — Georgi Gerganov's incredible C++ library that makes LLM inference possible on consumer hardware. Let's dive into how we use it to run AI models on your Android phone.

What is llama.cpp?

llama.cpp is a pure C/C++ implementation of LLM inference. It was originally built to run Meta's LLaMA model on MacBooks, but has since grown into a universal inference engine supporting dozens of model architectures.

Key advantages:

  • Pure C++ — No Python, no heavy frameworks
  • CPU-optimized — Uses SIMD instructions (NEON on ARM) for fast matrix operations
  • Memory-efficient — Supports quantization to dramatically reduce memory usage
  • Cross-platform — Works on Linux, macOS, Windows, Android, iOS

The JNI Bridge

Android apps are written in Kotlin/Java, but llama.cpp is C++. We bridge this gap using JNI (Java Native Interface):

  • C++ layer — Compiled llama.cpp with NDK for arm64-v8a and x86_64
  • JNI wrapper — C functions that translate between Java types and C++ types
  • Kotlin API — Clean Kotlin interface that the rest of the app uses

This architecture means the inference engine runs at near-native speed — no interpreter overhead, no virtual machine penalties.

Memory-Mapped Model Loading

One of llama.cpp's cleverest features is memory-mapped file loading (mmap). Instead of reading the entire model into RAM:

  • The model file is mapped to virtual memory
  • Only the needed portions are loaded into physical RAM
  • The OS handles paging automatically
  • This dramatically reduces startup time and memory usage

A 2.8 GB model file might only use 1.5-2 GB of actual RAM during inference.

Quantization Explained

Full-precision models (FP16) are too large for phones. Quantization compresses the model by reducing numerical precision:

How It Works

  • FP16 (full precision): Each weight is a 16-bit float → Large files, best quality
  • Q8_0: Each weight is 8 bits → ~50% size reduction, near-original quality
  • Q5_K_M: Each weight is ~5.5 bits → ~65% reduction, very good quality
  • Q4_K_M: Each weight is ~4.5 bits → ~72% reduction, good quality

MyLLM defaults to Q4_K_M because it offers the best quality-to-size ratio. The quality loss is barely perceptible for most tasks.

The Inference Pipeline

When you send a message in MyLLM, here's what happens:

1. Tokenization

Your text is converted into tokens — numerical representations that the model understands. "Hello, how are you?" might become [15496, 11, 703, 527, 498, 30].

2. Prompt Formatting (ChatML)

MyLLM uses the ChatML format to structure conversations:

  • System prompt defines the AI's behavior
  • User messages are wrapped with special tokens
  • Assistant responses are properly delimited

3. Forward Pass

The tokens are fed through the neural network's layers. On a modern phone with a 4B model:

  • ~32 transformer layers
  • Each layer: self-attention + feed-forward network
  • Billions of multiply-accumulate operations per token
  • NEON SIMD instructions process multiple values simultaneously

4. Sampling

The model outputs probabilities for the next token. MyLLM uses sophisticated sampling:

  • Temperature — Controls randomness (lower = more deterministic)
  • Top-K — Only considers the K most likely tokens
  • Top-P — Considers tokens until cumulative probability exceeds P
  • Repetition penalty — Prevents the model from repeating itself

5. Token-by-Token Generation

Each new token is generated one at a time, fed back into the model to generate the next token. This is why you see text appearing word-by-word in the chat.

Performance Optimizations

We've made several optimizations for mobile:

  • Thread pool management — Optimal thread count based on device capabilities
  • KV cache management — Efficient key-value cache for multi-turn conversations
  • Memory pressure handling — Graceful degradation when RAM is low
  • Thermal management — Reduced throughput when the device gets hot

Open Source

Both llama.cpp and MyLLM are open source. You can explore the code, suggest improvements, or build your own AI apps on top of this foundation.

Download MyLLM AI →

MyLLM AI Team

Building the future of private, on-device AI. We believe AI should run on your phone, respect your privacy, and be free for everyone.

Stay in the loop

Get updates on new features

We'll send you occasional updates about new models, features, and releases. No spam, ever.