Documentation

Models

The two curated Gemma 4 GGUFs Tacita ships, the Discover surface for additional Gemma-family models, and how the runtime planner picks settings for each device tier.

The curated models

Tacita ships with two pre-vetted models from the Gemma 4 instruction-tuned family. Both are quantised to Q4_K_M and distributed by Unsloth on Hugging Face. The download UI verifies the SHA-256 of the file before declaring the model ready to load.

Light

File	`gemma-4-E2B-it-Q4_K_M.gguf`
Size	~2.89 GiB on disk
Family	Gemma 4 E2B (instruction-tuned)
Quantisation	Q4_K_M
Host	`unsloth/gemma-4-E2B-it-GGUF`
Chat template	`gemma`
Minimum device RAM	4 GB
Tier	Free and Pro

Pro

File	`gemma-4-E4B-it-Q4_K_M.gguf`
Size	~4.64 GiB on disk
Family	Gemma 4 E4B (instruction-tuned)
Quantisation	Q4_K_M
Host	`unsloth/gemma-4-E4B-it-GGUF`
Chat template	`gemma`
Minimum device RAM	8 GB
Tier	Pro only

Discover (Pro only)

Pro users see a Discover surface that lists other Gemma-family GGUFs hosted on Hugging Face — every generation: 1, 2, 3, 3n, 4. There is no architecture pre-filter; the engine load is the source of truth. The two curated repos are excluded from the Discover list so the surface always adds something new.

The user can also paste a Hugging Face repository URL directly. The download UI verifies the file hash, lays it down under the app's documents directory, and registers it with the runtime planner.

Device tier matrix

The runtime planner picks n_threads, KV-cache K and V quantisation, and the recommended max context based on the RAM the device actually has. Recommended max context is what a typical chat will land at; the planner can go higher when free RAM permits, capped at the model's native maximum.

Device class	Tier	n_threads	KV K	KV V	Light max ctx	Pro max ctx
Pixel 8 Pro (Tensor G3, 12 GB)	Flagship	6	f16	f16	~8K (native)	~8K (native)
iPhone 16 Pro (8 GB)	Flagship	6	f16	f16	~8K	~6–8K
Galaxy S24 (8–12 GB)	Flagship	6	f16	f16	~8K	~6–8K
Pixel 7a (Tensor G2, 8 GB)	Mid-range	4	q8_0	q8_0	~6–8K	~3–4K
iPhone 13 (4 GB)	Mid-range	5 (iOS)	q8_0	q8_0	~3–4K	not recommended
Galaxy A54 (6 GB)	Mid-range	4	q8_0	q8_0	~4–6K	~2–3K
4 GB low-end Android	Low-end	2	q8_0	q8_0	~1–2K	not recommended

On flagships we keep the KV cache at f16 because RAM headroom is ample and quality matters more than effective context. On mid-range and low-end devices we drop to q8_0, which doubles effective context at roughly 3% perplexity cost on public benchmarks.

Why Gemma 4

Gemma 4 sits in a sweet spot for mobile inference: open weights, permissive licensing for distribution as a GGUF, strong instruction-following at small parameter counts, and a chat template fllama already implements. The thought channel that Tacita exposes as the reasoning trace is a Gemma 4 affordance — parsed by Tacita's own <|channel>thought handler.

Other architectures (Llama-3, Mistral, Phi-3, Qwen) work in llama.cpp but are not currently first-class in Tacita: the runtime planner is calibrated for Gemma's KV layout and the sampling presets are tuned for Gemma's behaviour. Pro's Discover surface is intentionally Gemma-only for this reason.

Update policy

Curated model URLs and hashes are pinned in the app and verified on download. Changing a quantisation or family also requires updating the per-token KV-cache size, the engine overhead bytes, and the native max context the runtime planner uses for sizing. None of this happens silently; a model swap rides on a regular app release.

Related: the privacy architecture page for the storage layer the model files sit on, the glossary for definitions of GGUF, quantisation, and KV cache, and the FAQ for direct answers to common model questions.