Models
The two curated Gemma 4 GGUFs Tacita ships, the Discover surface for additional Gemma-family models, and how the runtime planner picks settings for each device tier.
The curated models
Tacita ships with two pre-vetted models from the Gemma 4 instruction-tuned family. Both are quantised to Q4_K_M and distributed by Unsloth on Hugging Face. The download UI verifies the SHA-256 of the file before declaring the model ready to load.
Light
| File | gemma-4-E2B-it-Q4_K_M.gguf |
|---|---|
| Size | ~2.89 GiB on disk |
| Family | Gemma 4 E2B (instruction-tuned) |
| Quantisation | Q4_K_M |
| Host | unsloth/gemma-4-E2B-it-GGUF |
| Chat template | gemma |
| Minimum device RAM | 4 GB |
| Tier | Free and Pro |
Pro
| File | gemma-4-E4B-it-Q4_K_M.gguf |
|---|---|
| Size | ~4.64 GiB on disk |
| Family | Gemma 4 E4B (instruction-tuned) |
| Quantisation | Q4_K_M |
| Host | unsloth/gemma-4-E4B-it-GGUF |
| Chat template | gemma |
| Minimum device RAM | 8 GB |
| Tier | Pro only |
Discover (Pro only)
Pro users see a Discover surface that lists other Gemma-family GGUFs hosted on Hugging Face — every generation: 1, 2, 3, 3n, 4. There is no architecture pre-filter; the engine load is the source of truth. The two curated repos are excluded from the Discover list so the surface always adds something new.
The user can also paste a Hugging Face repository URL directly. The download UI verifies the file hash, lays it down under the app's documents directory, and registers it with the runtime planner.
Device tier matrix
The runtime planner picks n_threads, KV-cache K and V
quantisation, and the recommended max context based on the RAM
the device actually has. Recommended max context is what a typical
chat will land at; the planner can go higher when free RAM
permits, capped at the model's native maximum.
| Device class | Tier | n_threads | KV K | KV V | Light max ctx | Pro max ctx |
|---|---|---|---|---|---|---|
| Pixel 8 Pro (Tensor G3, 12 GB) | Flagship | 6 | f16 | f16 | ~8K (native) | ~8K (native) |
| iPhone 16 Pro (8 GB) | Flagship | 6 | f16 | f16 | ~8K | ~6–8K |
| Galaxy S24 (8–12 GB) | Flagship | 6 | f16 | f16 | ~8K | ~6–8K |
| Pixel 7a (Tensor G2, 8 GB) | Mid-range | 4 | q8_0 | q8_0 | ~6–8K | ~3–4K |
| iPhone 13 (4 GB) | Mid-range | 5 (iOS) | q8_0 | q8_0 | ~3–4K | not recommended |
| Galaxy A54 (6 GB) | Mid-range | 4 | q8_0 | q8_0 | ~4–6K | ~2–3K |
| 4 GB low-end Android | Low-end | 2 | q8_0 | q8_0 | ~1–2K | not recommended |
On flagships we keep the KV cache at f16 because RAM
headroom is ample and quality matters more than effective context.
On mid-range and low-end devices we drop to q8_0,
which doubles effective context at roughly 3% perplexity cost on
public benchmarks.
Why Gemma 4
Gemma 4 sits in a sweet spot for mobile inference: open weights,
permissive licensing for distribution as a GGUF, strong
instruction-following at small parameter counts, and a chat
template fllama already implements. The thought channel that
Tacita exposes as the reasoning trace is a Gemma 4 affordance —
parsed by Tacita's own <|channel>thought handler.
Other architectures (Llama-3, Mistral, Phi-3, Qwen) work in llama.cpp but are not currently first-class in Tacita: the runtime planner is calibrated for Gemma's KV layout and the sampling presets are tuned for Gemma's behaviour. Pro's Discover surface is intentionally Gemma-only for this reason.
Update policy
Curated model URLs and hashes are pinned in the app and verified on download. Changing a quantisation or family also requires updating the per-token KV-cache size, the engine overhead bytes, and the native max context the runtime planner uses for sizing. None of this happens silently; a model swap rides on a regular app release.
Related: the privacy architecture page for the storage layer the model files sit on, the glossary for definitions of GGUF, quantisation, and KV cache, and the FAQ for direct answers to common model questions.