---
# Running a Fully Local Voice Pipeline for Claude Code on RHEL 10 with Intel GPU Acceleration

**URL:** https://crunchtools.com/local-voice-pipeline-claude-code-rhel10-intel-gpu/
Date: 2026-02-15
Author: fatherlinux
Post Type: post
Summary: How I set up a fully local voice pipeline for Claude Code on RHEL 10 using Whisper.cpp, Kokoro TTS, and Intel Meteor Lake GPU acceleration via PyTorch XPU -- achieving a 22x speedup with no cloud dependencies.&amp;lt;p class=&amp;quot;excert-link-wrapper&amp;quot;&amp;gt;&amp;lt;a href=&amp;quot;https://crunchtools.com/local-voice-pipeline-claude-code-rhel10-intel-gpu/&amp;quot; class=&amp;quot;excerpt-more-link&amp;quot; &amp;gt;Continue Reading&amp;lt;span class=&amp;quot;screen-reader-text&amp;quot;&amp;gt; &amp;quot;Running a Fully Local Voice Pipeline for Claude Code on RHEL 10 with Intel GPU Acceleration&amp;quot;&amp;lt;/span&amp;gt;&amp;lt;span class=&amp;quot;meta-nav&amp;quot;&amp;gt; &amp;rarr;&amp;lt;/span&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/p&amp;gt;
Categories: Articles
Tags: Community, Linux, O, Open Source Software, Tutorials
Featured Image: https://crunchtools.com/wp-content/uploads/2026/02/voicemode-voice-pipeline-thumbnail-v2-scaled.jpeg
---

I&#039;ve been running Claude Code as my daily driver for software engineering work. It&#039;s powerful, but typing every command gets tedious. I wanted voice interaction -- speak to Claude, hear it respond -- but without sending my audio to the cloud. Here&#039;s how I got a fully local voice pipeline running on my ThinkPad X1 Carbon with RHEL 10, including GPU acceleration on the Intel Meteor Lake integrated graphics.

Also, this blog serves as a strange combination of learning, experimentation, and something you can use to point your agent at to figure out the best way to set this up yourself, on your Linux machine (RHEL 10 in my case). This was one of the more uncomfortable endeavors I&#039;ve taken on since adopting Agentic/Claude so heavily. I had to accept that it was getting certain things to work, before I even fully understood them. It literally got things to the point were I could speak into the computer, and it could respond to me, before I fully understood the architecture or what software was being used.

I even included a timeline so that you can understand how much experimentation and learning we did. My conclusion is that we must change how we think about building and configuring software. We must accept that the entire process has always been non-deterministic, and that this latest iteration of technology forces us to accept that.

## The Goal

Run VoiceMode MCP (Model Context Protocol) with Claude Code using entirely local speech processing:

 	- **Speech-to-Text (STT):** Whisper.cpp -- OpenAI&#039;s Whisper model compiled to run natively on CPU

 	- **Text-to-Speech (TTS):** Kokoro -- a local neural TTS engine

 	- **No cloud dependencies:** No OpenAI API key, no audio leaving my machine

## What is MCP?

MCP (Model Context Protocol) is an open standard that lets AI assistants like Claude connect to external tools and data sources. Think of it like USB for AI -- a standardized way to plug capabilities into your AI assistant. VoiceMode is an MCP server that gives Claude the ability to hear and speak through your computer&#039;s microphone and speakers.

## What is VoiceMode?

[VoiceMode](https://github.com/mbailey/voicemode) (v8.2.0) is an MIT-licensed MCP server that adds voice conversation capabilities to Claude Code. It supports both cloud processing (via OpenAI&#039;s API) and fully local processing using:

 	- **Whisper.cpp** for speech-to-text (STT) on port 2022

 	- **Kokoro** for text-to-speech (TTS) on port 8880

For privacy-sensitive work, the local stack is ideal -- all audio processing stays on your device.

## The Architecture

Here&#039;s how the pieces fit together:
```
You speak into mic
       |
       v
  [Microphone] --&gt; [VoiceMode MCP Server] --&gt; [Whisper.cpp :2022]
                                                    |
                                              Speech-to-Text
                                                    |
                                                    v
                                           [Claude Code CLI]
                                                    |
                                             Claude responds
                                                    |
                                                    v
                                          [VoiceMode MCP Server] --&gt; [Kokoro :8880]
                                                                          |
                                                                    Text-to-Speech
                                                                          |
                                                                          v
                                                                     [Speakers]
                                                                          |
                                                                     You hear it

```

The VoiceMode MCP server orchestrates the whole flow. When Claude calls the `converse` tool, it:

 	- Sends text to Kokoro for speech synthesis (TTS)

 	- Plays the audio through your speakers

 	- Records from your microphone

 	- Sends the recording to Whisper for transcription (STT)

 	- Returns the transcribed text back to Claude

### Two Models, Two Jobs

A common point of confusion: Whisper and Kokoro are completely different models that do opposite jobs. They share nothing -- different authors, different architectures, different frameworks, different purposes. Whisper listens; Kokoro speaks.

**Whisper.cpp (STT)**
**Kokoro (TTS)**

**Task**
Speech-to-Text
Text-to-Speech

**Input**
Your voice (audio from mic)
Claude&#039;s text response

**Output**
Transcribed text
Spoken audio (to speakers)

**Model origin**
OpenAI Whisper
Kokoro-82M v1.0

**Parameters**
74M (base model)
82M

**Framework**
C++ (ggml)
Python (PyTorch)

**Architecture**
Encoder-decoder transformer
StyleTTS2-based

**Port**
2022
8880

**Speed on CPU**
~1.5s (fast)
~60s (painfully slow)

The speed difference on CPU is what drove the whole GPU acceleration effort in this post. Whisper.cpp is already compiled to efficient C++ and runs fine on CPU. Kokoro runs through Python and PyTorch, where CPU inference is much slower -- making it the bottleneck that needed GPU help.

## Setting It Up on RHEL 10 Bootc

I run RHEL 10 as a bootable container (bootc) image. This means my OS is an immutable container image pulled from a registry. Package changes go into a Containerfile, get built, pushed to quay.io, and then applied with `bootc upgrade`.

### Step 1: Add Dependencies to the Containerfile

I added these packages to my Containerfile for VoiceMode support:

**Build dependencies for Whisper.cpp:**
```
python3-devel
alsa-lib-devel
portaudio-devel
cmake
gcc-c++
SDL2-devel
ffmpeg-free

```

**Vulkan GPU acceleration (for Whisper.cpp):**
```
vulkan-headers
vulkan-loader-devel
glslang

```

**Intel GPU compute drivers (for Kokoro TTS):**
```
intel-level-zero
intel-opencl
oneapi-level-zero

```

I also added `uv tool install voice-mode` to the Python tools section of the Containerfile to bake VoiceMode into the image.

### Step 2: Build, Push, and Reboot

```
cd ~/Projects/rhel10-bootc
podman build -t quay.io/fatherlinux/rhel10-bootc:latest .
podman push quay.io/fatherlinux/rhel10-bootc:latest
sudo bootc upgrade
sudo systemctl reboot

```

### Step 3: Install the Voice Services

After rebooting onto the new image:
```
# Install local Whisper STT
voicemode service install whisper

# Install local Kokoro TTS
voicemode service install kokoro

# Both services start automatically via systemd user units

```

### Step 4: Configure for Local-Only Processing

Edit `~/.voicemode/voicemode.env` to point exclusively at local services:
```
# Point to local services only (no OpenAI fallback)
VOICEMODE_TTS_BASE_URLS=http://127.0.0.1:8880/v1
VOICEMODE_STT_BASE_URLS=http://127.0.0.1:2022/v1

# Prefer local providers
VOICEMODE_PREFER_LOCAL=true
VOICEMODE_ALWAYS_TRY_LOCAL=true

```

### Step 5: Add the MCP Server to Claude Code

```
claude mcp add --scope user voicemode -- uvx --refresh voice-mode

```

## The GPU Acceleration Problem

With everything running, I tested it. Whisper (STT) was fast -- about 1.5 seconds to transcribe speech. But Kokoro (TTS) was painfully slow: **60 seconds** to generate a short response on CPU. That&#039;s unusable for conversation.

My ThinkPad has an Intel Meteor Lake integrated GPU. The question was: could I use it to accelerate Kokoro?

## What is XPU?

Here&#039;s where the jargon gets thick, so let me define things:

 	- **XPU**: Intel&#039;s umbrella term for their compute accelerators (GPUs, FPGAs, etc.). In PyTorch, `xpu` is the device type for Intel GPUs, similar to how `cuda` is the device type for NVIDIA GPUs.

 	- **PyTorch**: The machine learning framework that Kokoro uses for neural network inference.

 	- **Level Zero**: Intel&#039;s low-level GPU programming API (similar to NVIDIA&#039;s CUDA driver API). The `oneapi-level-zero` package provides the loader, and `intel-level-zero` provides the Intel GPU implementation.

 	- **OpenCL**: An open standard for parallel computing on GPUs. `intel-opencl` provides Intel&#039;s implementation.

 	- **IPEX**: Intel Extension for PyTorch -- a now-deprecated library that added Intel GPU support to PyTorch. As of PyTorch 2.5+, Intel GPU support is built directly into PyTorch via the native `xpu` device.

 	- **JIT compilation**: Just-In-Time compilation. When you first run a model on XPU, PyTorch compiles GPU kernels on the fly. This makes the first run slow but subsequent runs fast.

## Patching Kokoro for Intel XPU

Kokoro-FastAPI (the server that VoiceMode uses) only supports three devices out of the box: CUDA (NVIDIA), MPS (Apple Silicon), and CPU. It has no Intel GPU support. I patched it with about 20 lines of code changes.

### Patch 1: Device Detection (`api/src/core/config.py`)

Added XPU to the auto-detection chain:
```
def get_device(self) -&gt; str:
    if not self.use_gpu:
        return &quot;cpu&quot;
    if self.device_type:
        return self.device_type
    # Auto-detect device
    if torch.backends.mps.is_available():
        return &quot;mps&quot;
    elif torch.cuda.is_available():
        return &quot;cuda&quot;
    elif hasattr(torch, &#039;xpu&#039;) and torch.xpu.is_available():  # NEW
        return &quot;xpu&quot;                                            # NEW
    return &quot;cpu&quot;

```

### Patch 2: Model Loading (`api/src/inference/kokoro_v1.py`)

Added XPU device handling:
```
if self._device == &quot;mps&quot;:
    self._model = self._model.to(torch.device(&quot;mps&quot;))
elif self._device == &quot;cuda&quot;:
    self._model = self._model.cuda()
elif self._device == &quot;xpu&quot;:                                     # NEW
    self._model = self._model.to(torch.device(&quot;xpu&quot;))           # NEW
else:
    self._model = self._model.cpu()

```

### Patch 3: Install PyTorch with XPU Support

The default Kokoro installation uses the CPU-only PyTorch build. I replaced it with the XPU build:
```
cd ~/.voicemode/services/kokoro
uv pip install --python .venv/bin/python --reinstall torch \
    --index-url https://download.pytorch.org/whl/xpu

```

This installed PyTorch 2.10.0+xpu with Intel GPU support, including the oneAPI Math Kernel Library (oneMKL) for accelerated linear algebra.

### Patch 4: Start Script and systemd Unit

Created `start-gpu_intel.sh`:
```
#!/bin/bash
PROJECT_ROOT=$(pwd)
export USE_GPU=true
export USE_ONNX=false
export DEVICE_TYPE=xpu
export PYTHONPATH=$PROJECT_ROOT:$PROJECT_ROOT/api
export MODEL_DIR=src/models
export VOICES_DIR=src/voices/v1_0
export WEB_PLAYER_PATH=$PROJECT_ROOT/web

uv run --no-sync python docker/scripts/download_model.py --output api/src/models/v1_0
uv run --no-sync uvicorn api.src.main:app --host 0.0.0.0 --port 8880

```

Updated the systemd unit to use the new script:
```
ExecStart=/home/user/.voicemode/services/kokoro/start-gpu_intel.sh

```

## The Results

After restarting Kokoro with XPU support:

Metric
CPU
Intel XPU
Speedup

TTS generation (short text)
60.5s
0.4s
**151x**

TTS generation (1-2 sentences)
60.5s
2.9s
**21x**

TTS generation (paragraph)
60.5s
3.7s
**16x**

First-run warmup (JIT compilation)
73s
668s
Slower (one-time cost)

Memory usage
~1.3 GB
~3.9 GB
Higher

The first-run JIT compilation penalty is steep (11 minutes), but this is a one-time cost. Once the kernels are compiled, TTS generation drops to well under 4 seconds -- fast enough for natural conversation.

A complete voice round-trip (Claude speaks + I respond + transcription) takes about 12 seconds total:
- TTS generation: ~2.7s
- Audio playback: ~4-7s (depends on message length)
- Recording: ~3-7s (depends on how long you speak)
- STT transcription: ~1.5-2.5s

## Going Deeper: Optimization Exploration

With 2.7 seconds on XPU working, I wanted to push further. Could we get under a second? I had Claude explore every option.

### What We Tried (and What Failed)

**Float16 precision:** Converting the model to half-precision float should theoretically double throughput. But Kokoro&#039;s voice tensors are loaded as float32, and mixing dtypes crashed immediately: `expected mat1 and mat2 to have the same dtype, but got: float != c10::Half`.

**torch.compile():** PyTorch&#039;s ahead-of-time compilation framework wraps models in an `OptimizedModule`, which broke Kokoro&#039;s `KModel` interface (`KModel does not support len()`).

**torch.autocast:** The &quot;safe&quot; way to do mixed precision. It worked technically, but results were wildly inconsistent -- 2.5 to 19 seconds depending on input length. Intel XPU&#039;s autocast triggers JIT recompilation for each new tensor shape. Not mature enough yet.

**OpenVINO with Intel GPU:** This was the most promising on paper -- Intel&#039;s own inference engine on Intel hardware. But the Kokoro ONNX model uses 3D tensor interpolation, and OpenVINO&#039;s GPU plugin only supports 2D, 4D, and 5D. The CPU plugin failed too -- Kokoro&#039;s STFT operation uses dynamic rank, which OpenVINO doesn&#039;t handle. These are fundamental model architecture incompatibilities, not configuration issues.

**Linux GPU tuning:** The GPU was already running at max boost (2000 MHz), power limits were set to 55W (3.7x the 15W design), and the CPU governor was already `performance`. No gains left on the table.

### What Actually Worked: ONNX on CPU

Plain ONNX Runtime with `CPUExecutionProvider` (no OpenVINO) was an interesting finding. Using the `kokoro-onnx` library:

Text Length
ONNX CPU
PyTorch XPU

Short (2 words)
0.85s
**0.37s**

1 sentence
3.9s
**2.9s**

Paragraph (~60 words)
12.1s
**3.7s**

ONNX CPU has zero warmup time (vs 11 minutes for XPU JIT), which is appealing. But it scales linearly with text length because CPUs process sequentially. The GPU&#039;s parallelism makes PyTorch XPU consistently faster for real-world usage where Claude&#039;s responses are typically multiple sentences.

The ONNX backend code is still in the codebase (toggled via `USE_ONNX=true`) as a fallback for systems without GPU support.

### Bottom Line

PyTorch XPU at 2.7-3.7 seconds remains the best option for Intel integrated graphics. The Intel XPU ecosystem is still young -- float16, torch.compile, and OpenVINO all hit limitations. But raw float32 inference on the GPU delivers a 22x speedup over CPU, and that&#039;s enough for usable voice conversation.

## Lessons Learned

 	- **The bootc workflow adds friction but pays off.** Rebuilding and rebooting for package changes is slower than `dnf install`, but having an immutable, reproducible OS image is worth it. Every machine that pulls this image gets VoiceMode support automatically.

 	- **Intel GPU compute on Linux is maturing, but it&#039;s not NVIDIA yet.** With NVIDIA CUDA, you get 15+ years of ML ecosystem support. Kokoro ships with a `start-gpu.sh` that uses CUDA out of the box. PyTorch&#039;s default download is the CUDA build. Nearly every ML application has CUDA support baked in, and kernel warmup is near-instant because kernels are pre-compiled. With Intel XPU, none of that is true today. PyTorch&#039;s native `xpu` device only landed in PyTorch 2.5 (late 2024). Kokoro had zero Intel GPU support -- we patched it ourselves. The PyTorch XPU wheels live on a separate index URL, not the default. And the first-run JIT compilation penalty is 11 minutes. But here&#039;s what makes it &quot;maturing&quot;: the driver packages (`intel-level-zero`, `intel-opencl`) ship in RHEL 10&#039;s standard repos -- no third-party repos needed. PyTorch&#039;s `xpu` device works as a drop-in for `cuda` -- our patch was only about 20 lines. And it delivers real speedup (22x). The infrastructure is there; application-level support is what&#039;s still catching up.

 	- **Local voice AI is practical today.** A fully local voice pipeline with no cloud dependencies runs well on a laptop with an integrated GPU. The 82M parameter Kokoro model is small enough to fit in iGPU memory while delivering natural-sounding speech.

 	- **First-run JIT compilation is the main pain point.** The 11-minute warmup on first XPU run is rough. Future work could include ahead-of-time compilation or kernel caching to eliminate this.

## The Stack

For reference, here&#039;s the complete local voice stack:

 	- **OS:** RHEL 10 (bootc image)

 	- **Hardware:** ThinkPad X1 Carbon, Intel Meteor Lake, Intel Graphics (iGPU)

 	- **STT:** Whisper.cpp v1.8.3 with ggml-base model (port 2022)

 	- **TTS:** Kokoro-FastAPI v0.2.4 with Kokoro v1.0 82M model (port 8880)

 	- **GPU:** PyTorch 2.10.0+xpu on Intel Level Zero / OpenCL

 	- **MCP Server:** VoiceMode v8.2.0

 	- **AI Assistant:** Claude Code with Claude Opus 4.6

No OpenAI API key. No cloud audio processing. Just local compute doing local inference.

## Timeline

Here&#039;s the chronological journey of this project, from first attempt to final result:

 	- **Containerfile setup** -- Added build deps (python3-devel, cmake, SDL2, etc.), Vulkan packages, and Intel GPU compute drivers (intel-level-zero, intel-opencl) to RHEL 10 bootc image. Built, pushed to quay.io, rebooted.

 	- **VoiceMode install** -- Ran `voicemode service install whisper` and `voicemode service install kokoro`. Both came up as systemd user services. Configured for local-only processing.

 	- **First test: CPU baseline** -- Whisper STT: 1.5s (fast). Kokoro TTS: 60.5s (unusable). The bottleneck was clear.

 	- **Intel XPU patch** -- Added ~20 lines to Kokoro-FastAPI: XPU device detection in config.py, model loading in kokoro_v1.py, XPU memory management. Installed PyTorch 2.10.0+xpu. Created start-gpu_intel.sh.

 	- **XPU result: 22x speedup** -- TTS dropped from 60.5s to 2.7s. First-run JIT warmup: 11 minutes (one-time).

 	- **Float16 attempt** -- Crashed: dtype mismatch between float16 model weights and float32 voice tensors.

 	- **torch.compile() attempt** -- Crashed: OptimizedModule breaks KModel&#039;s len() interface.

 	- **torch.autocast attempt** -- Worked but erratic: 2.5-19s depending on input length. XPU JIT recompiles per tensor shape.

 	- **Linux GPU tuning check** -- Already maxed: 2000 MHz boost, 55W power, performance governor. No gains available.

 	- **OpenVINO GPU attempt** -- Failed: 3D tensor interpolation unsupported (only 2D/4D/5D).

 	- **OpenVINO CPU attempt** -- Failed: dynamic rank STFT operation unsupported.

 	- **ONNX CPU backend** -- Built working backend using kokoro-onnx library. Zero warmup, but 12s for paragraphs vs 3.7s on XPU. Kept as fallback.

 	- **Final benchmarks** -- PyTorch XPU post-warmup: 0.4s short text, 2.9s sentences, 3.7s paragraphs. Winner across the board for real-world text lengths.

---

## Categories

- Articles

---

## Navigation

- [Crunchtools](https://crunchtools.com/)
- [Articles](https://crunchtools.com/category/articles/)
- [Events](https://crunchtools.com/category/events/)
- [News](https://crunchtools.com/category/news/)
- [Presentations](https://crunchtools.com/category/presentations/)
- [Software](https://crunchtools.com/software/)
- [Beaver Backup](https://crunchtools.com/software/beaver-backup/)
- [Check BGP Neighbors](https://crunchtools.com/software/check-bgp-neighbors-nagios/)
- [Chev](https://crunchtools.com/software/chev-check-vulnerabilities-script/)
- [Graph BGP Neighbors](https://crunchtools.com/software/grpah-bgp-neighbors/)
- [Graph MySQL Stats](https://crunchtools.com/software/graph-mysql-stats/)
- [Graph Sockets Pipes Files](https://crunchtools.com/software/graph-sockets-pipes-files/)
- [MCP Servers](https://crunchtools.com/software/mcp-servers/)
- [Petit](https://crunchtools.com/software/petit/)
- [Racecar](https://crunchtools.com/software/racecar/)
- [Shiva](https://crunchtools.com/software/shiva/)
- [About](https://crunchtools.com/about/)
- [Home](https://crunchtools.com)

## Tags

- Community
- Linux
- O
- Open Source Software
- Tutorials