Models and Providers¶
Koharu uses both vision models and language models. The vision stack prepares the page; the language stack handles translation.
If you want the architecture-level explanation of how these pieces fit together, read Technical Deep Dive after this page.
Vision models¶
Koharu automatically downloads the required vision models when you use them for the first time.
The default stack includes:
- PP-DocLayoutV3 for text detection and layout analysis
- comic-text-detector for text segmentation
- PaddleOCR-VL-1.5 for OCR text recognition
- lama-manga for inpainting
- YuzuMarker.FontDetection for font and color detection
Converted model weights are hosted on Hugging Face in safetensors format for Rust compatibility and performance.
What each vision model is¶
| Model | Model type | Why Koharu uses it |
|---|---|---|
PP-DocLayoutV3 |
layout detector | finds text-like regions and reading order |
comic-text-detector |
segmentation network | produces a text mask for cleanup |
PaddleOCR-VL-1.5 |
vision-language model | reads cropped text into text tokens |
lama-manga |
inpainting network | reconstructs the image after text removal |
YuzuMarker.FontDetection |
classifier / regressor | estimates font and style hints for rendering |
The important design choice is that Koharu does not use a single model for every page task. Layout, segmentation, OCR, and inpainting all need different output shapes:
- layout wants regions and order
- segmentation wants per-pixel masks
- OCR wants text
- inpainting wants restored pixels
Local LLMs¶
Koharu supports local GGUF models through llama.cpp. These models run on your machine and are downloaded on demand when you select them in the LLM picker.
In practice, the local models are usually quantized decoder-only transformers. GGUF is the file format; llama.cpp is the inference runtime.
Suggested local models for English output¶
- vntl-llama3-8b-v2: around 8.5 GB in Q8_0 form, best when translation quality matters most
- lfm2-350m-enjp-mt: very small and useful for low-memory systems or quick previews
Suggested local models for Chinese output¶
- sakura-galtransl-7b-v3.7: a balanced choice for quality and speed on 8 GB class GPUs
- sakura-1.5b-qwen2.5-v1.0: a lighter option for mid-range or CPU-heavy setups
Suggested local model for broader language coverage¶
- hunyuan-7b-mt-v1.0: a multi-language option with moderate hardware requirements
Remote providers¶
Koharu can translate through remote or self-hosted APIs instead of downloading a local model.
Supported providers include:
- OpenAI
- Gemini
- Claude
- DeepSeek
- OpenAI-compatible APIs such as LM Studio, OpenRouter, or any endpoint that exposes
/v1/modelsand/v1/chat/completions
Remote providers are configured in Settings > API Keys.
For a step-by-step setup guide for LM Studio, OpenRouter, and similar endpoints, see Use OpenAI-Compatible APIs.
Choosing between local and remote¶
Use local models when you want:
- the most private setup
- offline operation after downloads complete
- tighter control over hardware usage
Use remote providers when you want:
- to avoid large local model downloads
- to reduce local VRAM or RAM usage
- to connect to a hosted or self-managed model service
Note
When you use a remote provider, Koharu sends OCR text selected for translation to the provider you configured.
Background reading¶
For theory and diagrams behind the model categories on this page, see: