Models and Providers¶
Koharu uses both vision models and language models. The vision stack prepares the page; the language stack handles translation.
If you want the architecture-level view of how these pieces fit together, read Technical Deep Dive after this page.
Vision models¶
Koharu downloads required vision models automatically the first time you use them.
The current default stack includes:
- comic-text-bubble-detector for joint text-block and speech-bubble detection
- comic-text-detector for text segmentation masks
- PaddleOCR-VL-1.5 for OCR text recognition
- aot-inpainting for default inpainting
- YuzuMarker.FontDetection for font and color detection
Some models are used directly from upstream Hugging Face repos, while converted safetensors weights are hosted on Hugging Face when Koharu needs a Rust-friendly bundle.
What each vision model is¶
| Model | Model type | Why Koharu uses it |
|---|---|---|
comic-text-bubble-detector |
object detector | finds text blocks and speech bubble regions in one pass |
comic-text-detector |
segmentation network | produces a text mask for cleanup |
PaddleOCR-VL-1.5 |
vision-language model | reads cropped text into text tokens |
aot-inpainting |
inpainting network | reconstructs masked image regions after text removal |
YuzuMarker.FontDetection |
classifier / regressor | estimates font and style hints for rendering |
The important design choice is that Koharu does not use one model for every page task. Detection, segmentation, OCR, and inpainting all need different output shapes:
- joint detection wants text blocks and bubble regions
- segmentation wants per-pixel masks
- OCR wants text
- inpainting wants restored pixels
Optional built-in alternatives¶
You can swap individual stages in Settings > Engines. Built-in alternatives include:
- PP-DocLayoutV3 as an alternative detector and layout-analysis engine
- speech-bubble-segmentation as a dedicated bubble detector
- Manga OCR and MIT 48px OCR as alternative OCR engines
- FLUX.2 Klein 4B as an optional FLUX.2-based inpainter
- lama-manga as an alternative inpainter
Local LLMs¶
Koharu supports local GGUF models through llama.cpp. These models run on your machine and are downloaded on demand when you select them in the LLM picker.
In practice, the local models are usually quantized decoder-only transformers. GGUF is the model format; llama.cpp is the inference runtime.
Translation-focused built-in local models for English output¶
- vntl-llama3-8b-v2: a Q5_K_M GGUF, best when translation quality matters most
- lfm2.5-1.2b-instruct: a smaller multilingual instruct option for low-memory systems or faster iteration
- sugoi-14b-ultra and sugoi-32b-ultra: larger translation-oriented choices when you want more headroom
Translation-focused built-in local models for Chinese output¶
- sakura-galtransl-7b-v3.7: a balanced choice for quality and speed on 8 GB class GPUs
- sakura-1.5b-qwen2.5-v1.0: a lighter option for mid-range or CPU-heavy setups
Translation-focused built-in local model for broader language coverage¶
- hunyuan-mt-7b: a multi-language option with moderate hardware requirements
Other built-in local model families¶
The local picker also includes general-purpose families that are not translation-specific:
- Gemma 4 instruct:
gemma4-e2b-it,gemma4-e4b-it,gemma4-26b-a4b-it,gemma4-31b-it - Gemma 4 uncensored:
gemma4-e2b-uncensored,gemma4-e4b-uncensored - Qwen 3.5:
qwen3.5-0.8b,qwen3.5-2b,qwen3.5-4b,qwen3.5-9b,qwen3.5-27b,qwen3.5-35b-a3b - Qwen 3.5 uncensored:
qwen3.5-2b-uncensored,qwen3.5-4b-uncensored,qwen3.5-9b-uncensored,qwen3.5-27b-uncensored,qwen3.5-35b-a3b-uncensored - Qwen 3.6:
qwen3.6-27b,qwen3.6-35b-a3b - Qwen 3.6 uncensored:
qwen3.6-27b-uncensored,qwen3.6-35b-a3b-uncensored
Remote providers¶
Koharu can also translate through remote or self-hosted APIs instead of downloading a local model.
Supported provider families are:
- LLM-backed:
OpenAI,Gemini,Claude,DeepSeek, plus anyOpenAI-compatibleendpoint that exposes/v1/modelsand/v1/chat/completions(LM Studio, OpenRouter, vLLM, etc.) - Machine-translation:
DeepL,Google Cloud Translation,Caiyun
Machine-translation providers are pure translation services rather than chat models. They take source text and a target language, and return a translation; there is no system prompt and no model picker.
Current built-in remote LLM models¶
The built-in catalog for LLM-backed providers includes:
- OpenAI: GPT-5.5, GPT-5.4, GPT-5.x, GPT-4.1, o-series, GPT-4o, and legacy GPT chat models
- Gemini: Gemini 3.1, Gemini 3, Gemini 2.5, Gemini 2.0 text-output models, plus Gemma 4 hosted through the Gemini API
- Claude: current Claude Opus, Sonnet, and Haiku 4.x models, plus deprecated Claude 4 snapshots that remain available until their upstream retirement dates
- DeepSeek: DeepSeek V4 Flash, DeepSeek V4 Pro, and the
deepseek-chat/deepseek-reasonercompatibility aliases - OpenAI-compatible APIs: models are discovered dynamically from the configured endpoint
Machine-translation providers¶
| Provider | What you need | Notes |
|---|---|---|
DeepL |
DeepL API key | Optional custom base URL for DeepL Pro vs. Free endpoints |
Google Cloud Translation |
Google Cloud API key | Uses the v2 REST endpoint |
Caiyun |
Caiyun token | Limited target-language coverage |
Remote providers are configured in Settings > API Keys.
For a step-by-step setup guide for LM Studio, OpenRouter, and similar endpoints, see Use OpenAI-Compatible APIs.
Codex image generation¶
Koharu can also use Codex for end-to-end image-to-image generation. Instead of translating text blocks and rendering text locally as separate steps, this workflow sends the source page image and prompt to Codex and receives a generated page image.
This is a remote image-generation workflow, not a local model. It requires a ChatGPT account with Codex access and two-factor authentication enabled before device-code login can complete. See Use Codex Image Generation for usage notes and caveats.
Choosing between local and remote¶
Use local models when you want:
- the most private setup
- offline operation after downloads complete
- tighter control over hardware usage
Use remote providers when you want:
- to avoid large local model downloads
- to reduce local VRAM or RAM usage
- to connect to a hosted or self-managed model service
Note
When you use a remote provider, Koharu sends OCR text selected for translation to the provider you configured.
Background reading¶
For background theory behind the model categories on this page, see: