Skip to content
Documentation

Models

The two curated Gemma 4 GGUFs Tacita ships, the Discover surface for additional Gemma-family models, and how the runtime planner picks settings for each device tier.

The curated models

Tacita ships with two pre-vetted models from the Gemma 4 instruction-tuned family. Both are quantised to Q4_K_M and distributed by Unsloth on Hugging Face. The download UI verifies the SHA-256 of the file before declaring the model ready to load.

Light

Filegemma-4-E2B-it-Q4_K_M.gguf
Size~2.89 GiB on disk
FamilyGemma 4 E2B (instruction-tuned)
QuantisationQ4_K_M
Hostunsloth/gemma-4-E2B-it-GGUF
Chat templategemma
Minimum device RAM4 GB
TierFree and Pro

Pro

Filegemma-4-E4B-it-Q4_K_M.gguf
Size~4.64 GiB on disk
FamilyGemma 4 E4B (instruction-tuned)
QuantisationQ4_K_M
Hostunsloth/gemma-4-E4B-it-GGUF
Chat templategemma
Minimum device RAM8 GB
TierPro only

Discover (Pro only)

Pro users see a Discover surface that lists other Gemma-family GGUFs hosted on Hugging Face — every generation: 1, 2, 3, 3n, 4. There is no architecture pre-filter; the engine load is the source of truth. The two curated repos are excluded from the Discover list so the surface always adds something new.

The user can also paste a Hugging Face repository URL directly. The download UI verifies the file hash, lays it down under the app's documents directory, and registers it with the runtime planner.

Device tier matrix

The runtime planner picks n_threads, KV-cache K and V quantisation, and the recommended max context based on the RAM the device actually has. Recommended max context is what a typical chat will land at; the planner can go higher when free RAM permits, capped at the model's native maximum.

Device class Tier n_threads KV K KV V Light max ctx Pro max ctx
Pixel 8 Pro (Tensor G3, 12 GB) Flagship6f16f16 ~8K (native)~8K (native)
iPhone 16 Pro (8 GB) Flagship6f16f16 ~8K~6–8K
Galaxy S24 (8–12 GB) Flagship6f16f16 ~8K~6–8K
Pixel 7a (Tensor G2, 8 GB) Mid-range4q8_0q8_0 ~6–8K~3–4K
iPhone 13 (4 GB) Mid-range5 (iOS)q8_0q8_0 ~3–4Knot recommended
Galaxy A54 (6 GB) Mid-range4q8_0q8_0 ~4–6K~2–3K
4 GB low-end Android Low-end2q8_0q8_0 ~1–2Knot recommended

On flagships we keep the KV cache at f16 because RAM headroom is ample and quality matters more than effective context. On mid-range and low-end devices we drop to q8_0, which doubles effective context at roughly 3% perplexity cost on public benchmarks.

Why Gemma 4

Gemma 4 sits in a sweet spot for mobile inference: open weights, permissive licensing for distribution as a GGUF, strong instruction-following at small parameter counts, and a chat template fllama already implements. The thought channel that Tacita exposes as the reasoning trace is a Gemma 4 affordance — parsed by Tacita's own <|channel>thought handler.

Other architectures (Llama-3, Mistral, Phi-3, Qwen) work in llama.cpp but are not currently first-class in Tacita: the runtime planner is calibrated for Gemma's KV layout and the sampling presets are tuned for Gemma's behaviour. Pro's Discover surface is intentionally Gemma-only for this reason.

Update policy

Curated model URLs and hashes are pinned in the app and verified on download. Changing a quantisation or family also requires updating the per-token KV-cache size, the engine overhead bytes, and the native max context the runtime planner uses for sizing. None of this happens silently; a model swap rides on a regular app release.


Related: the privacy architecture page for the storage layer the model files sit on, the glossary for definitions of GGUF, quantisation, and KV cache, and the FAQ for direct answers to common model questions.