Models and Providers¶

Koharu uses both vision models and language models. The vision stack prepares the page; the language stack handles translation.

If you want the architecture-level view of how these pieces fit together, read Technical Deep Dive after this page.

Vision models¶

Koharu downloads required vision models automatically the first time you use them.

The current default stack includes:

comic-text-bubble-detector for joint text-block and speech-bubble detection
comic-text-detector for text segmentation masks
PaddleOCR-VL-1.5 for OCR text recognition
aot-inpainting for default inpainting
YuzuMarker.FontDetection for font and color detection

Some models are used directly from upstream Hugging Face repos, while converted safetensors weights are hosted on Hugging Face when Koharu needs a Rust-friendly bundle.

What each vision model is¶

Model	Model type	Why Koharu uses it
`comic-text-bubble-detector`	object detector	finds text blocks and speech bubble regions in one pass
`comic-text-detector`	segmentation network	produces a text mask for cleanup
`PaddleOCR-VL-1.5`	vision-language model	reads cropped text into text tokens
`aot-inpainting`	inpainting network	reconstructs masked image regions after text removal
`YuzuMarker.FontDetection`	classifier / regressor	estimates font and style hints for rendering

The important design choice is that Koharu does not use one model for every page task. Detection, segmentation, OCR, and inpainting all need different output shapes:

joint detection wants text blocks and bubble regions
segmentation wants per-pixel masks
OCR wants text
inpainting wants restored pixels

Optional built-in alternatives¶

You can swap individual stages in Settings > Engines. Built-in alternatives include:

PP-DocLayoutV3 as an alternative detector and layout-analysis engine
speech-bubble-segmentation as a dedicated bubble detector
Manga OCR and MIT 48px OCR as alternative OCR engines
FLUX.2 Klein 4B as an optional FLUX.2-based inpainter
lama-manga as an alternative inpainter

Local LLMs¶

Koharu supports local GGUF models through llama.cpp. These models run on your machine and are downloaded on demand when you select them in the LLM picker.

In practice, the local models are usually quantized decoder-only transformers. GGUF is the model format; llama.cpp is the inference runtime.

Translation-focused built-in local models for English output¶

vntl-llama3-8b-v2: a Q5_K_M GGUF, best when translation quality matters most
lfm2.5-1.2b-instruct: a smaller multilingual instruct option for low-memory systems or faster iteration
sugoi-14b-ultra and sugoi-32b-ultra: larger translation-oriented choices when you want more headroom

Translation-focused built-in local models for Chinese output¶

sakura-galtransl-7b-v3.7: a balanced choice for quality and speed on 8 GB class GPUs
sakura-1.5b-qwen2.5-v1.0: a lighter option for mid-range or CPU-heavy setups

Translation-focused built-in local model for broader language coverage¶

hunyuan-mt-7b: a multi-language option with moderate hardware requirements

Other built-in local model families¶

The local picker also includes general-purpose families that are not translation-specific:

Gemma 4 instruct: gemma4-e2b-it, gemma4-e4b-it, gemma4-26b-a4b-it, gemma4-31b-it
Gemma 4 uncensored: gemma4-e2b-uncensored, gemma4-e4b-uncensored
Qwen 3.5: qwen3.5-0.8b, qwen3.5-2b, qwen3.5-4b, qwen3.5-9b, qwen3.5-27b, qwen3.5-35b-a3b
Qwen 3.5 uncensored: qwen3.5-2b-uncensored, qwen3.5-4b-uncensored, qwen3.5-9b-uncensored, qwen3.5-27b-uncensored, qwen3.5-35b-a3b-uncensored
Qwen 3.6: qwen3.6-27b, qwen3.6-35b-a3b
Qwen 3.6 uncensored: qwen3.6-27b-uncensored, qwen3.6-35b-a3b-uncensored

Remote providers¶

Koharu can also translate through remote or self-hosted APIs instead of downloading a local model.

Supported provider families are:

LLM-backed: OpenAI, Gemini, Claude, DeepSeek, plus any OpenAI-compatible endpoint that exposes /v1/models and /v1/chat/completions (LM Studio, OpenRouter, vLLM, etc.)
Machine-translation: DeepL, Google Cloud Translation, Caiyun

Machine-translation providers are pure translation services rather than chat models. They take source text and a target language, and return a translation; there is no system prompt and no model picker.

Current built-in remote LLM models¶

The built-in catalog for LLM-backed providers includes:

OpenAI: GPT-5.5, GPT-5.4, GPT-5.x, GPT-4.1, o-series, GPT-4o, and legacy GPT chat models
Gemini: Gemini 3.1, Gemini 3, Gemini 2.5, Gemini 2.0 text-output models, plus Gemma 4 hosted through the Gemini API
Claude: current Claude Opus, Sonnet, and Haiku 4.x models, plus deprecated Claude 4 snapshots that remain available until their upstream retirement dates
DeepSeek: DeepSeek V4 Flash, DeepSeek V4 Pro, and the deepseek-chat / deepseek-reasoner compatibility aliases
OpenAI-compatible APIs: models are discovered dynamically from the configured endpoint

Machine-translation providers¶

Provider	What you need	Notes
`DeepL`	DeepL API key	Optional custom base URL for DeepL Pro vs. Free endpoints
`Google Cloud Translation`	Google Cloud API key	Uses the v2 REST endpoint
`Caiyun`	Caiyun token	Limited target-language coverage

Remote providers are configured in Settings > API Keys.

For a step-by-step setup guide for LM Studio, OpenRouter, and similar endpoints, see Use OpenAI-Compatible APIs.

Codex image generation¶

Koharu can also use Codex for end-to-end image-to-image generation. Instead of translating text blocks and rendering text locally as separate steps, this workflow sends the source page image and prompt to Codex and receives a generated page image.

This is a remote image-generation workflow, not a local model. It requires a ChatGPT account with Codex access and two-factor authentication enabled before device-code login can complete. See Use Codex Image Generation for usage notes and caveats.

Choosing between local and remote¶

Use local models when you want:

the most private setup
offline operation after downloads complete
tighter control over hardware usage

Use remote providers when you want:

to avoid large local model downloads
to reduce local VRAM or RAM usage
to connect to a hosted or self-managed model service

Note

When you use a remote provider, Koharu sends OCR text selected for translation to the provider you configured.

Background reading¶

For background theory behind the model categories on this page, see: