Local Models and Open Source Agents (and Why You Need to Pay Attention)

Local Models and Open Source Agents (and Why You Need to Pay Attention)

There’s a lot of negativity toward AI in the Fedora and RHEL communities right now. I get it — the hype cycle is real, and a lot of the marketing is insufferable. But I think the negativity is causing people to tune out, and when you tune out, you miss genuinely cool work that’s directly relevant to what we do every day, and more importantly enables ethical actors to drive the world forward

But I digress…. 🙂

A colleague of mine in the Red Hat Enterprise Linux team just put out a video over at Brian’s AI and Linux Videos. He takes a single local model — GLM-4-9B-Chat — and shows how it goes from completely non-functional to reliably managing a fleet of Fedora systems, just by changing the environment it runs in. No model swap. No upgrade to a bigger model. Just configuration. And, all running on his local system – no calling out to OpenAI, Claude, Gemini, etc. Pretty neat stuff.

Full disclosure: I drafted this blog post with Claude Code, a proprietary AI tool, though I went through it by hand, with a fine toothed comb, and modified it substantially. I use Claude Code daily, and I think what Anthropic is building has some of the same spirit as open source — the MCP protocol itself is open, and they’ve been good citizens in the ecosystem.

But that’s exactly why work like Brian’s matters. If the only path to reliable AI tooling runs through a paid API, we’ve just recreated the same vendor dependency our community has spent decades pushing back against. I want open source AI tools to flourish, and Brian’s video shows they’re closer than most people think.

Why You Should Watch This Video

It’s highlighting what cutting-edge systems administration looks like in 2026. Brian is using an open source AI agent called Goose to check the health of three Fedora systems via something called an MCP server. That’s real infrastructure work.

The difference is that the AI agent decides what to check, how to interpret it, and what to flag. It’s making parallel calls to collect system info, CPU stats, memory, disk usage, services, network interfaces, ports, and running processes — across three hosts simultaneously. If you care about open source, this should get you excited – everything in Brian’s stack is open. The model, the inference engine, the agent framework, the MCP server. No API keys to a proprietary service, no vendor lock-in, no monthly bill that scales with usage. This is what AI looks like when it’s built the way we’ve always built open source infrastructure: with tools you can inspect, modify, and run on your own hardware.

A Jargon Decoder for the Rest of Us

I realize this video (and this whole space) is drowning in new terminology. If you’ve spent the last decade working with Linux, Docker/Podman, Kubernetes, RPMs, Ansible, or systemd, half of these terms probably sound like someone made them up at a startup pitch meeting (welcome to how people felt when the DevOps community kicked off). So let me translate.

Tool Calling — This is the ability of an LLM to generate a structured request to execute a specific function. Think of it like a script calling a binary with specific flags and arguments. The AI doesn’t run the tool itself — it produces a structured output that says “call this function with these parameters,” and an orchestrator executes it. If you’ve ever written a wrapper script that parses input and calls the right command, you understand the concept.

MCP (Model Context Protocol) — A standardized protocol that lets LLMs connect to external tools and data sources. The best analogy for our world: it’s like a device driver, but for AI. Just like the kernel needs a driver to talk to your NIC, an LLM needs an MCP server to talk to your infrastructure. MCP standardizes that interface so you don’t need a custom integration for every tool. In Brian’s video, the MCP server is what lets the AI actually SSH into those Fedora boxes and run commands. I recently wrote about building MCP servers the right way if you want to dig deeper into the architecture and security considerations.

Inference Engine — The software that loads and runs the AI model. This is your application server. Just like you’d choose between Apache and Nginx to serve HTTP, you choose between Ollama, llama.cpp, or vLLM to serve AI inference. Brian’s key finding is that Ollama — the “easy mode” option that everyone defaults to — actively breaks tool calling for certain models. Same model, different inference engine, completely different results.

Quantization (Q3, Q8) — Compression for AI models. A full-precision model might need 32GB of VRAM. A Q8 (8-bit) quantized version might need 10GB. A Q3 (3-bit) version might need 4GB. The trade-off is the same as any compression: you lose fidelity. For chat, Q3 is usually fine. For tool calling — where the model needs to produce precisely structured output — Brian shows that Q3 introduces duplicate calls and errors that Q8 doesn’t. If you need precision, don’t over-compress.

Context Window — The model’s working memory. It’s the buffer that holds the entire conversation: your prompt, the tool definitions, every tool call, every result. If you run out of context, the model either loses track of what it’s doing or crashes entirely. Brian demonstrates this by shrinking the context from 75,000 tokens to 15,000 — the model starts strong, then hits the wall and fails. Think of it like running a big Ansible playbook on a machine with 512MB of RAM. You’ll get partway through before things fall apart.

Temperature / Top P — These control how “creative” or “random” the model’s output is. High temperature = more variation, more surprises. Low temperature = more deterministic, more predictable. For creative writing, you want some temperature. For tool calling — where the model needs to produce exact JSON with the right function names and parameters — you want it locked down. Brian shows that the defaults (which are tuned for chat) produce duplicate and erratic tool calls, while the model creator’s recommended settings (Temperature 0.7, Top P 1.0) clean everything up.

Goose — An open source AI agent framework. If the inference engine is your application server, Goose is your orchestration layer — think VIM with plugins. It takes a user request, breaks it into steps, makes tool calls via MCP, collects results, and synthesizes a response.

Jinja (Templates) — A templating system used to format prompts for specific models. Every model expects its input in a slightly different format, and Jinja templates handle the translation. If you’ve used Ansible templates or Python string formatting, same concept. Wrong template = garbled input = bad results.

The Actual Lesson

Here’s what Brian demonstrates in sequence, and why each step matters:

  1. Ollama with GLM-4 Q8: Complete failure. Zero successful tool calls. The model wants to call tools but can’t produce valid output. If you stopped here, you’d conclude the model is broken.
  2. llama.cpp with default settings, Q8: Partial success. Tool calls work, but the model makes duplicate calls and only parallelizes for one of three hosts. Usable, but wasteful.
  3. llama.cpp with recommended settings, Q3: Better parallelization (all three hosts at once), but the quantization introduces duplicate calls and interpretation errors.
  4. llama.cpp with recommended settings, Q8: 18 parallel tool calls across all three systems, one intelligent follow-up call, clean results, no duplicates. This is what “working correctly” looks like.

The difference between test 1 and test 4 is zero model changes. Same weights, same architecture, same “intelligence.” The only variables were the inference engine, the sampling parameters, and the quantization level.

For anyone who’s ever tuned a database, optimized a kernel, or debugged a misbehaving service by adjusting its configuration — this should feel very familiar. It’s also what systems administration looks like in 2026. This is our new job.

Why This Matters for Fedora and RHEL People

I’ll be direct: tool calling is how AI becomes useful for infrastructure work. Without it, you have a chat-bot. With it, you have an agent that can actually manage systems.

It’s also worth noting what Brian’s demo is and isn’t. He’s running health checks — read-only, exploratory work. The AI is gathering information, interpreting it, and flagging problems. It’s not re-configuring services or pushing changes to production. That distinction matters. I wrote about this over at InfoWorld in AI Agents and IT Ops: Cowboy Chaos Rides Again — the argument being that AI agents are powerful for exploration, analysis, and helping design the artifacts (playbooks, Dockerfiles, configs) that build your environments, but you still need deterministic, audit-able processes for the changes themselves. Brian’s video is a great example of the right pattern: let the AI do the intelligence work, keep the guardrails where they belong.

Brian is running these health checks across Fedora systems using Podman containers, llama.cpp, and open source tooling. Every piece of this stack is open source — exactly the kind of tooling our community should be embracing and championing. The new parts — the inference engine, the MCP server, the agent framework — they’re just services. They have configuration files. They have tuning parameters. They run in containers. They break in predictable ways when you misconfigure them.

The thing that would be a mistake is to see a local model fail at tool calling with default settings and conclude “AI isn’t ready for this.” That’s like installing PostgreSQL with default settings, running a complex query, watching it time out, and concluding that PostgreSQL can’t handle your workload. The defaults are a starting point. The tuning is the work.

And if there’s one thing our community is good at, it’s tuning.

Go Watch the Video

Brian’s video is about 10 minutes and worth your time: Why Your Local LLM Can’t Use Tools / MCP Servers (And How to Fix It). Even if you’re not running local models today, the concepts — inference engines, MCP, context windows, quantization trade-offs — are going to be part of our vocabulary whether we like it or not. Better to learn them now on your own terms.

And if you’re already running local models and getting poor results with tool calling, before you blame the model, check your environment. You might be one configuration change away from something that actually works. The best proprietary AI tools are genuinely good — I use them, and I’m not going to pretend otherwise. But the open source stack is getting there, and it needs people in our community paying attention, contributing, and pushing it forward. Not tuning out.

Leave a Reply

Your email address will not be published. Required fields are marked *