I’ve been running Claude Code as my daily driver for software engineering work. It’s powerful, but typing every command gets tedious. I wanted voice interaction — speak to Claude, hear it respond — but without sending my audio to the cloud. Here’s how I got a fully local voice pipeline running on my ThinkPad X1 Carbon with RHEL 10, including GPU acceleration on the Intel Meteor Lake integrated graphics.
Also, this blog serves as a strange combination of learning, experimentation, and something you can use to point your agent at to figure out the best way to set this up yourself, on your Linux machine (RHEL 10 in my case). This was one of the more uncomfortable endeavors I’ve taken on since adopting Agentic/Claude so heavily. I had to accept that it was getting certain things to work, before I even fully understood them. It literally got things to the point were I could speak into the computer, and it could respond to me, before I fully understood the architecture or what software was being used.
I even included a timeline so that you can understand how much experimentation and learning we did. My conclusion is that we must change how we think about building and configuring software. We must accept that the entire process has always been non-deterministic, and that this latest iteration of technology forces us to accept that.
The Goal
Run VoiceMode MCP (Model Context Protocol) with Claude Code using entirely local speech processing:
- Speech-to-Text (STT): Whisper.cpp — OpenAI’s Whisper model compiled to run natively on CPU
- Text-to-Speech (TTS): Kokoro — a local neural TTS engine
- No cloud dependencies: No OpenAI API key, no audio leaving my machine
What is MCP?
MCP (Model Context Protocol) is an open standard that lets AI assistants like Claude connect to external tools and data sources. Think of it like USB for AI — a standardized way to plug capabilities into your AI assistant. VoiceMode is an MCP server that gives Claude the ability to hear and speak through your computer’s microphone and speakers.
What is VoiceMode?
VoiceMode (v8.2.0) is an MIT-licensed MCP server that adds voice conversation capabilities to Claude Code. It supports both cloud processing (via OpenAI’s API) and fully local processing using:
- Whisper.cpp for speech-to-text (STT) on port 2022
- Kokoro for text-to-speech (TTS) on port 8880
For privacy-sensitive work, the local stack is ideal — all audio processing stays on your device.
The Architecture
Here’s how the pieces fit together:
You speak into mic
|
v
[Microphone] --> [VoiceMode MCP Server] --> [Whisper.cpp :2022]
|
Speech-to-Text
|
v
[Claude Code CLI]
|
Claude responds
|
v
[VoiceMode MCP Server] --> [Kokoro :8880]
|
Text-to-Speech
|
v
[Speakers]
|
You hear it
The VoiceMode MCP server orchestrates the whole flow. When Claude calls the converse tool, it:
- Sends text to Kokoro for speech synthesis (TTS)
- Plays the audio through your speakers
- Records from your microphone
- Sends the recording to Whisper for transcription (STT)
- Returns the transcribed text back to Claude
Two Models, Two Jobs
A common point of confusion: Whisper and Kokoro are completely different models that do opposite jobs. They share nothing — different authors, different architectures, different frameworks, different purposes. Whisper listens; Kokoro speaks.
| Whisper.cpp (STT) | Kokoro (TTS) | |
|---|---|---|
| Task | Speech-to-Text | Text-to-Speech |
| Input | Your voice (audio from mic) | Claude’s text response |
| Output | Transcribed text | Spoken audio (to speakers) |
| Model origin | OpenAI Whisper | Kokoro-82M v1.0 |
| Parameters | 74M (base model) | 82M |
| Framework | C++ (ggml) | Python (PyTorch) |
| Architecture | Encoder-decoder transformer | StyleTTS2-based |
| Port | 2022 | 8880 |
| Speed on CPU | ~1.5s (fast) | ~60s (painfully slow) |
The speed difference on CPU is what drove the whole GPU acceleration effort in this post. Whisper.cpp is already compiled to efficient C++ and runs fine on CPU. Kokoro runs through Python and PyTorch, where CPU inference is much slower — making it the bottleneck that needed GPU help.
Setting It Up on RHEL 10 Bootc
I run RHEL 10 as a bootable container (bootc) image. This means my OS is an immutable container image pulled from a registry. Package changes go into a Containerfile, get built, pushed to quay.io, and then applied with bootc upgrade.
Step 1: Add Dependencies to the Containerfile
I added these packages to my Containerfile for VoiceMode support:
Build dependencies for Whisper.cpp:
python3-devel
alsa-lib-devel
portaudio-devel
cmake
gcc-c++
SDL2-devel
ffmpeg-free
Vulkan GPU acceleration (for Whisper.cpp):
vulkan-headers
vulkan-loader-devel
glslang
Intel GPU compute drivers (for Kokoro TTS):
intel-level-zero
intel-opencl
oneapi-level-zero
I also added uv tool install voice-mode to the Python tools section of the Containerfile to bake VoiceMode into the image.
Step 2: Build, Push, and Reboot
cd ~/Projects/rhel10-bootc
podman build -t quay.io/fatherlinux/rhel10-bootc:latest .
podman push quay.io/fatherlinux/rhel10-bootc:latest
sudo bootc upgrade
sudo systemctl reboot
Step 3: Install the Voice Services
After rebooting onto the new image:
# Install local Whisper STT
voicemode service install whisper
# Install local Kokoro TTS
voicemode service install kokoro
# Both services start automatically via systemd user units
Step 4: Configure for Local-Only Processing
Edit ~/.voicemode/voicemode.env to point exclusively at local services:
# Point to local services only (no OpenAI fallback)
VOICEMODE_TTS_BASE_URLS=http://127.0.0.1:8880/v1
VOICEMODE_STT_BASE_URLS=http://127.0.0.1:2022/v1
# Prefer local providers
VOICEMODE_PREFER_LOCAL=true
VOICEMODE_ALWAYS_TRY_LOCAL=true
Step 5: Add the MCP Server to Claude Code
claude mcp add --scope user voicemode -- uvx --refresh voice-mode
The GPU Acceleration Problem
With everything running, I tested it. Whisper (STT) was fast — about 1.5 seconds to transcribe speech. But Kokoro (TTS) was painfully slow: 60 seconds to generate a short response on CPU. That’s unusable for conversation.
My ThinkPad has an Intel Meteor Lake integrated GPU. The question was: could I use it to accelerate Kokoro?
What is XPU?
Here’s where the jargon gets thick, so let me define things:
- XPU: Intel’s umbrella term for their compute accelerators (GPUs, FPGAs, etc.). In PyTorch,
xpuis the device type for Intel GPUs, similar to howcudais the device type for NVIDIA GPUs. - PyTorch: The machine learning framework that Kokoro uses for neural network inference.
- Level Zero: Intel’s low-level GPU programming API (similar to NVIDIA’s CUDA driver API). The
oneapi-level-zeropackage provides the loader, andintel-level-zeroprovides the Intel GPU implementation. - OpenCL: An open standard for parallel computing on GPUs.
intel-openclprovides Intel’s implementation. - IPEX: Intel Extension for PyTorch — a now-deprecated library that added Intel GPU support to PyTorch. As of PyTorch 2.5+, Intel GPU support is built directly into PyTorch via the native
xpudevice. - JIT compilation: Just-In-Time compilation. When you first run a model on XPU, PyTorch compiles GPU kernels on the fly. This makes the first run slow but subsequent runs fast.
Patching Kokoro for Intel XPU
Kokoro-FastAPI (the server that VoiceMode uses) only supports three devices out of the box: CUDA (NVIDIA), MPS (Apple Silicon), and CPU. It has no Intel GPU support. I patched it with about 20 lines of code changes.
Patch 1: Device Detection (api/src/core/config.py)
Added XPU to the auto-detection chain:
def get_device(self) -> str:
if not self.use_gpu:
return "cpu"
if self.device_type:
return self.device_type
# Auto-detect device
if torch.backends.mps.is_available():
return "mps"
elif torch.cuda.is_available():
return "cuda"
elif hasattr(torch, 'xpu') and torch.xpu.is_available(): # NEW
return "xpu" # NEW
return "cpu"
Patch 2: Model Loading (api/src/inference/kokoro_v1.py)
Added XPU device handling:
if self._device == "mps":
self._model = self._model.to(torch.device("mps"))
elif self._device == "cuda":
self._model = self._model.cuda()
elif self._device == "xpu": # NEW
self._model = self._model.to(torch.device("xpu")) # NEW
else:
self._model = self._model.cpu()
Patch 3: Install PyTorch with XPU Support
The default Kokoro installation uses the CPU-only PyTorch build. I replaced it with the XPU build:
cd ~/.voicemode/services/kokoro
uv pip install --python .venv/bin/python --reinstall torch \
--index-url https://download.pytorch.org/whl/xpu
This installed PyTorch 2.10.0+xpu with Intel GPU support, including the oneAPI Math Kernel Library (oneMKL) for accelerated linear algebra.
Patch 4: Start Script and systemd Unit
Created start-gpu_intel.sh:
#!/bin/bash
PROJECT_ROOT=$(pwd)
export USE_GPU=true
export USE_ONNX=false
export DEVICE_TYPE=xpu
export PYTHONPATH=$PROJECT_ROOT:$PROJECT_ROOT/api
export MODEL_DIR=src/models
export VOICES_DIR=src/voices/v1_0
export WEB_PLAYER_PATH=$PROJECT_ROOT/web
uv run --no-sync python docker/scripts/download_model.py --output api/src/models/v1_0
uv run --no-sync uvicorn api.src.main:app --host 0.0.0.0 --port 8880
Updated the systemd unit to use the new script:
ExecStart=/home/user/.voicemode/services/kokoro/start-gpu_intel.sh
The Results
After restarting Kokoro with XPU support:
| Metric | CPU | Intel XPU | Speedup |
|---|---|---|---|
| TTS generation (short text) | 60.5s | 0.4s | 151x |
| TTS generation (1-2 sentences) | 60.5s | 2.9s | 21x |
| TTS generation (paragraph) | 60.5s | 3.7s | 16x |
| First-run warmup (JIT compilation) | 73s | 668s | Slower (one-time cost) |
| Memory usage | ~1.3 GB | ~3.9 GB | Higher |
The first-run JIT compilation penalty is steep (11 minutes), but this is a one-time cost. Once the kernels are compiled, TTS generation drops to well under 4 seconds — fast enough for natural conversation.
A complete voice round-trip (Claude speaks + I respond + transcription) takes about 12 seconds total:
– TTS generation: ~2.7s
– Audio playback: ~4-7s (depends on message length)
– Recording: ~3-7s (depends on how long you speak)
– STT transcription: ~1.5-2.5s
Going Deeper: Optimization Exploration
With 2.7 seconds on XPU working, I wanted to push further. Could we get under a second? I had Claude explore every option.
What We Tried (and What Failed)
Float16 precision: Converting the model to half-precision float should theoretically double throughput. But Kokoro’s voice tensors are loaded as float32, and mixing dtypes crashed immediately: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half.
torch.compile(): PyTorch’s ahead-of-time compilation framework wraps models in an OptimizedModule, which broke Kokoro’s KModel interface (KModel does not support len()).
torch.autocast: The “safe” way to do mixed precision. It worked technically, but results were wildly inconsistent — 2.5 to 19 seconds depending on input length. Intel XPU’s autocast triggers JIT recompilation for each new tensor shape. Not mature enough yet.
OpenVINO with Intel GPU: This was the most promising on paper — Intel’s own inference engine on Intel hardware. But the Kokoro ONNX model uses 3D tensor interpolation, and OpenVINO’s GPU plugin only supports 2D, 4D, and 5D. The CPU plugin failed too — Kokoro’s STFT operation uses dynamic rank, which OpenVINO doesn’t handle. These are fundamental model architecture incompatibilities, not configuration issues.
Linux GPU tuning: The GPU was already running at max boost (2000 MHz), power limits were set to 55W (3.7x the 15W design), and the CPU governor was already performance. No gains left on the table.
What Actually Worked: ONNX on CPU
Plain ONNX Runtime with CPUExecutionProvider (no OpenVINO) was an interesting finding. Using the kokoro-onnx library:
| Text Length | ONNX CPU | PyTorch XPU |
|---|---|---|
| Short (2 words) | 0.85s | 0.37s |
| 1 sentence | 3.9s | 2.9s |
| Paragraph (~60 words) | 12.1s | 3.7s |
ONNX CPU has zero warmup time (vs 11 minutes for XPU JIT), which is appealing. But it scales linearly with text length because CPUs process sequentially. The GPU’s parallelism makes PyTorch XPU consistently faster for real-world usage where Claude’s responses are typically multiple sentences.
The ONNX backend code is still in the codebase (toggled via USE_ONNX=true) as a fallback for systems without GPU support.
Bottom Line
PyTorch XPU at 2.7-3.7 seconds remains the best option for Intel integrated graphics. The Intel XPU ecosystem is still young — float16, torch.compile, and OpenVINO all hit limitations. But raw float32 inference on the GPU delivers a 22x speedup over CPU, and that’s enough for usable voice conversation.
Lessons Learned
- The bootc workflow adds friction but pays off. Rebuilding and rebooting for package changes is slower than
dnf install, but having an immutable, reproducible OS image is worth it. Every machine that pulls this image gets VoiceMode support automatically. - Intel GPU compute on Linux is maturing, but it’s not NVIDIA yet. With NVIDIA CUDA, you get 15+ years of ML ecosystem support. Kokoro ships with a
start-gpu.shthat uses CUDA out of the box. PyTorch’s default download is the CUDA build. Nearly every ML application has CUDA support baked in, and kernel warmup is near-instant because kernels are pre-compiled. With Intel XPU, none of that is true today. PyTorch’s nativexpudevice only landed in PyTorch 2.5 (late 2024). Kokoro had zero Intel GPU support — we patched it ourselves. The PyTorch XPU wheels live on a separate index URL, not the default. And the first-run JIT compilation penalty is 11 minutes. But here’s what makes it “maturing”: the driver packages (intel-level-zero,intel-opencl) ship in RHEL 10’s standard repos — no third-party repos needed. PyTorch’sxpudevice works as a drop-in forcuda— our patch was only about 20 lines. And it delivers real speedup (22x). The infrastructure is there; application-level support is what’s still catching up. - Local voice AI is practical today. A fully local voice pipeline with no cloud dependencies runs well on a laptop with an integrated GPU. The 82M parameter Kokoro model is small enough to fit in iGPU memory while delivering natural-sounding speech.
- First-run JIT compilation is the main pain point. The 11-minute warmup on first XPU run is rough. Future work could include ahead-of-time compilation or kernel caching to eliminate this.
The Stack
For reference, here’s the complete local voice stack:
- OS: RHEL 10 (bootc image)
- Hardware: ThinkPad X1 Carbon, Intel Meteor Lake, Intel Graphics (iGPU)
- STT: Whisper.cpp v1.8.3 with ggml-base model (port 2022)
- TTS: Kokoro-FastAPI v0.2.4 with Kokoro v1.0 82M model (port 8880)
- GPU: PyTorch 2.10.0+xpu on Intel Level Zero / OpenCL
- MCP Server: VoiceMode v8.2.0
- AI Assistant: Claude Code with Claude Opus 4.6
No OpenAI API key. No cloud audio processing. Just local compute doing local inference.
Timeline
Here’s the chronological journey of this project, from first attempt to final result:
- Containerfile setup — Added build deps (python3-devel, cmake, SDL2, etc.), Vulkan packages, and Intel GPU compute drivers (intel-level-zero, intel-opencl) to RHEL 10 bootc image. Built, pushed to quay.io, rebooted.
- VoiceMode install — Ran
voicemode service install whisperandvoicemode service install kokoro. Both came up as systemd user services. Configured for local-only processing. - First test: CPU baseline — Whisper STT: 1.5s (fast). Kokoro TTS: 60.5s (unusable). The bottleneck was clear.
- Intel XPU patch — Added ~20 lines to Kokoro-FastAPI: XPU device detection in config.py, model loading in kokoro_v1.py, XPU memory management. Installed PyTorch 2.10.0+xpu. Created start-gpu_intel.sh.
- XPU result: 22x speedup — TTS dropped from 60.5s to 2.7s. First-run JIT warmup: 11 minutes (one-time).
- Float16 attempt — Crashed: dtype mismatch between float16 model weights and float32 voice tensors.
- torch.compile() attempt — Crashed: OptimizedModule breaks KModel’s len() interface.
- torch.autocast attempt — Worked but erratic: 2.5-19s depending on input length. XPU JIT recompiles per tensor shape.
- Linux GPU tuning check — Already maxed: 2000 MHz boost, 55W power, performance governor. No gains available.
- OpenVINO GPU attempt — Failed: 3D tensor interpolation unsupported (only 2D/4D/5D).
- OpenVINO CPU attempt — Failed: dynamic rank STFT operation unsupported.
- ONNX CPU backend — Built working backend using kokoro-onnx library. Zero warmup, but 12s for paragraphs vs 3.7s on XPU. Kept as fallback.
- Final benchmarks — PyTorch XPU post-warmup: 0.4s short text, 2.9s sentences, 3.7s paragraphs. Winner across the board for real-world text lengths.
