Skip to content

Models and Providers

Koharu uses both vision models and language models. The vision stack prepares the page; the language stack handles translation.

If you want the architecture-level view of how these pieces fit together, read Technical Deep Dive after this page.

Vision models

Koharu downloads required vision models automatically the first time you use them.

The current default stack includes:

Some models are used directly from upstream Hugging Face repos, while converted safetensors weights are hosted on Hugging Face when Koharu needs a Rust-friendly bundle.

What each vision model is

Model Model type Why Koharu uses it
comic-text-bubble-detector object detector finds text blocks and speech bubble regions in one pass
comic-text-detector segmentation network produces a text mask for cleanup
PaddleOCR-VL-1.5 vision-language model reads cropped text into text tokens
aot-inpainting inpainting network reconstructs masked image regions after text removal
YuzuMarker.FontDetection classifier / regressor estimates font and style hints for rendering

The important design choice is that Koharu does not use one model for every page task. Detection, segmentation, OCR, and inpainting all need different output shapes:

  • joint detection wants text blocks and bubble regions
  • segmentation wants per-pixel masks
  • OCR wants text
  • inpainting wants restored pixels

Optional built-in alternatives

You can swap individual stages in Settings > Engines. Built-in alternatives include:

Local LLMs

Koharu supports local GGUF models through llama.cpp. These models run on your machine and are downloaded on demand when you select them in the LLM picker.

In practice, the local models are usually quantized decoder-only transformers. GGUF is the model format; llama.cpp is the inference runtime.

Translation-focused built-in local models for English output

Translation-focused built-in local models for Chinese output

Translation-focused built-in local model for broader language coverage

  • hunyuan-mt-7b: a multi-language option with moderate hardware requirements

Other built-in local model families

The local picker also includes general-purpose families that are not translation-specific:

  • Gemma 4 instruct: gemma4-e2b-it, gemma4-e4b-it, gemma4-26b-a4b-it, gemma4-31b-it
  • Gemma 4 uncensored: gemma4-e2b-uncensored, gemma4-e4b-uncensored
  • Qwen 3.5: qwen3.5-0.8b, qwen3.5-2b, qwen3.5-4b, qwen3.5-9b, qwen3.5-27b, qwen3.5-35b-a3b
  • Qwen 3.5 uncensored: qwen3.5-2b-uncensored, qwen3.5-4b-uncensored, qwen3.5-9b-uncensored, qwen3.5-27b-uncensored, qwen3.5-35b-a3b-uncensored
  • Qwen 3.6: qwen3.6-27b, qwen3.6-35b-a3b
  • Qwen 3.6 uncensored: qwen3.6-27b-uncensored, qwen3.6-35b-a3b-uncensored

Remote providers

Koharu can also translate through remote or self-hosted APIs instead of downloading a local model.

Supported provider families are:

  • LLM-backed: OpenAI, Gemini, Claude, DeepSeek, plus any OpenAI-compatible endpoint that exposes /v1/models and /v1/chat/completions (LM Studio, OpenRouter, vLLM, etc.)
  • Machine-translation: DeepL, Google Cloud Translation, Caiyun

Machine-translation providers are pure translation services rather than chat models. They take source text and a target language, and return a translation; there is no system prompt and no model picker.

Current built-in remote LLM models

The built-in catalog for LLM-backed providers includes:

  • OpenAI: GPT-5.5, GPT-5.4, GPT-5.x, GPT-4.1, o-series, GPT-4o, and legacy GPT chat models
  • Gemini: Gemini 3.1, Gemini 3, Gemini 2.5, Gemini 2.0 text-output models, plus Gemma 4 hosted through the Gemini API
  • Claude: current Claude Opus, Sonnet, and Haiku 4.x models, plus deprecated Claude 4 snapshots that remain available until their upstream retirement dates
  • DeepSeek: DeepSeek V4 Flash, DeepSeek V4 Pro, and the deepseek-chat / deepseek-reasoner compatibility aliases
  • OpenAI-compatible APIs: models are discovered dynamically from the configured endpoint

Machine-translation providers

Provider What you need Notes
DeepL DeepL API key Optional custom base URL for DeepL Pro vs. Free endpoints
Google Cloud Translation Google Cloud API key Uses the v2 REST endpoint
Caiyun Caiyun token Limited target-language coverage

Remote providers are configured in Settings > API Keys.

For a step-by-step setup guide for LM Studio, OpenRouter, and similar endpoints, see Use OpenAI-Compatible APIs.

Codex image generation

Koharu can also use Codex for end-to-end image-to-image generation. Instead of translating text blocks and rendering text locally as separate steps, this workflow sends the source page image and prompt to Codex and receives a generated page image.

This is a remote image-generation workflow, not a local model. It requires a ChatGPT account with Codex access and two-factor authentication enabled before device-code login can complete. See Use Codex Image Generation for usage notes and caveats.

Choosing between local and remote

Use local models when you want:

  • the most private setup
  • offline operation after downloads complete
  • tighter control over hardware usage

Use remote providers when you want:

  • to avoid large local model downloads
  • to reduce local VRAM or RAM usage
  • to connect to a hosted or self-managed model service

Note

When you use a remote provider, Koharu sends OCR text selected for translation to the provider you configured.

Background reading

For background theory behind the model categories on this page, see: