Building a Local AI Pentest Assistant — From Bare Metal to Working Tool

Posted Feb 26, 2026

By Jesse Kennedy

11 min read

Why Local

Every time you paste a target IP, a hash, or a command sequence into ChatGPT, that data hits someone else’s servers. For pentesting work — client networks, credentials, attack methodology — that’s unacceptable. The answer isn’t to stop using AI. The answer is to own the stack.

This post covers building a fully local AI assistant for offensive security: inference engine, web interface, model selection, RAG knowledge bases, custom tooling, and system prompt engineering. Everything runs on consumer hardware. Nothing leaves the box.

There’s a second reason beyond privacy. AI companies are under increasing regulatory and government pressure. Contracts get revoked, APIs go down, policies change, safety guardrails tighten around exactly the kind of work we do. If your workflow depends on someone else’s API, you’re one policy update away from losing your tools. Download the weights. Own the models. They can’t take back what’s already on your hard drive.

Hardware

Nothing exotic. This is a desktop workstation, not a data center.

CPU: AMD Ryzen 9 7950X
GPU: NVIDIA RTX 4090 (24GB VRAM)
RAM: 64GB DDR5
OS: Fedora (latest stable)
Storage: NVMe for OS and models, HDD for bulk data

The 4090 is the bottleneck that matters. 24GB of VRAM determines which models fit entirely on the GPU (fast) vs which spill into system RAM (usable but slower). 64GB of system RAM gives headroom for the larger models that need to offload layers.

If you’re on a budget, this works on any NVIDIA GPU with 8GB+ VRAM — you’re just limited to smaller models. AMD GPU support through ROCm is improving but CUDA is still the path of least resistance.

The Stack

Three components: Ollama for inference, Open WebUI for the interface, and Podman for containerization.

Ollama

Ollama wraps llama.cpp and handles model management, GPU offloading, and serving an OpenAI-compatible API. One command install:

curl -fsSL https://ollama.com/install.sh | sh

Verify it sees your GPU:

ollama run qwen2.5:0.5b "say hello"

If that responds, CUDA is working. By default Ollama binds to localhost:11434. If you need other machines on your network to reach it (like a Kali VM), edit the systemd service:

sudo systemctl edit ollama.service

Add:

  
[Service]
Environment="OLLAMA_HOST=0.0.0.0"

Restart the service and it’s accessible on all interfaces.

Open WebUI

Open WebUI gives you a ChatGPT-style interface on top of Ollama with conversation history, model switching, RAG knowledge bases, custom tools, and system prompts. It’s the control plane for everything.

Fedora ships with Podman instead of Docker. Use it — it’s rootless by default and doesn’t need a daemon.

  
# Fix subuid/subgid for rootless containers
sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 $(whoami)
podman system migrate

# Deploy Open WebUI
podman run -d \
  --name open-webui \
  --network=host \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Enable lingering so the container survives logouts and starts on boot:

  
loginctl enable-linger $(whoami)
mkdir -p ~/.config/systemd/user
podman generate systemd --new --name open-webui > ~/.config/systemd/user/open-webui.service
systemctl --user daemon-reload
systemctl --user enable open-webui

Hit http://localhost:8080 and create your admin account. That’s the stack.

Model Selection

This is where most people overthink it. You need two models: one for coding tasks and one for general conversation. Maybe a third if you want a heavy hitter for hard problems.

What I Run

qwen3-coder:30b (19GB) — Primary coding model. MoE architecture with 30B total parameters and 3.3B active per token. Fits entirely in the 4090’s VRAM. Fast inference, good at generating scripts, understands tool syntax when given RAG context. This handles 90% of my AI-assisted pentesting work.

ollama pull qwen3-coder:30b

qwen3:32b (20GB) — General conversation, analysis, planning. Dense model that also fits in VRAM. Use this for writeup assistance, methodology discussion, and anything that isn’t code generation.

ollama pull qwen3:32b

qwen3-coder-next (52GB) — The heavy option. 80B total parameters, 3B active. Spills roughly 28GB into system RAM. Slower time-to-first-token but measurably better at following RAG context and overriding training data biases. I pull this out when the 30b model isn’t cutting it.

ollama pull qwen3-coder-next

What I Tested and Dropped

deepseek-coder-v2:16b — Smaller, simpler output, but too many methodology errors even with RAG. Removed.
qwen2.5-coder:32b — Superseded by qwen3-coder. No reason to keep the older generation.
qwen2.5:70b — Usable but slow. The 32b models at the qwen3 generation level match or beat it.

Model Selection Principles

Dense models (every parameter active every token) are straightforward — if the model size fits in VRAM, you get full speed. MoE models (Mixture of Experts) have a larger total parameter count but only activate a subset per token. This means a 30B-total MoE model with 3B active runs nearly as fast as a 3B dense model but with access to much more knowledge.

The tradeoff: MoE models need more storage and initial load time, but inference is fast because only the active parameters hit your GPU each token. For local pentesting work where you’re running long sessions, the initial load time is irrelevant.

If it doesn’t fit in VRAM, it spills into system RAM. Usable, but every token that touches RAM instead of VRAM is slower. The 4090’s 24GB is the hard constraint. Models up to about 20GB run entirely on GPU. Anything larger uses a mix.

RAG Knowledge Bases

This is the highest-leverage part of the entire setup. A model without domain-specific context will write syntactically valid but methodologically wrong pentesting code. I tested this extensively — see my previous post on testing local LLMs for offensive security.

RAG (Retrieval-Augmented Generation) lets you upload documents that get chunked into a vector database. When you ask the model a question, Open WebUI retrieves relevant chunks and injects them into the context window alongside your prompt. The model sees your documentation as if you’d pasted it into the conversation.

What to Upload

Clone these repositories and upload them as separate knowledge bases:

  
git clone --depth 1 https://github.com/HackTricks-wiki/hacktricks
git clone --depth 1 https://github.com/swisskyrepo/PayloadsAllTheThings
git clone --depth 1 https://github.com/The-Hacker-Recipes/The-Hacker-Recipes
git clone --depth 1 https://github.com/GTFOBins/GTFOBins.github.io
git clone --depth 1 https://github.com/LOLBAS-Project/LOLBAS

Upload each repo as its own knowledge base in Open WebUI under Workspace → Knowledge. Do not dump everything into one knowledge base. The retrieval quality degrades significantly when the vector database is polluted with thousands of unrelated documents. Targeted knowledge bases with focused content produce better context retrieval.

Additionally, create a personal methodology knowledge base. This is the most valuable one. Upload:

Your own pentesting notes from real engagements
Writeups you’ve done on CTF machines
Methodology docs with explicit correct vs incorrect tool usage
Command sequences that actually worked

Your personal notes outperform generic documentation because they contain the exact patterns you use in practice, written in your own language, with the context of why a particular approach works.

How to Attach Knowledge Bases

Don’t attach all knowledge bases to every conversation. Match the KB to the task:

Task	Knowledge Bases
Active Directory box	HackTricks + The-Hacker-Recipes + Your Notes
Linux privilege escalation	GTFOBins + HackTricks + Your Notes
Windows standalone	LOLBAS + HackTricks + Your Notes
Web application	PayloadsAllTheThings + HackTricks + Your Notes

Your personal notes stay attached to everything. They’re the constant. The repos rotate based on what you’re working on.

In Open WebUI, click the + icon in the chat input and select the knowledge bases you want for that conversation. They persist for the session.

Custom Tooling

Open WebUI supports Python tools that the model can call during a conversation. These are functions the model invokes autonomously when it determines they’re relevant — similar to function calling in the OpenAI API.

I wrote an offensive security toolkit that covers the repetitive tasks I was doing manually mid-conversation:

Payload encoding/decoding — base64, URL, double URL, hex, HTML entities, Unicode escapes, PowerShell UTF-16LE base64 for -EncodedCommand
Hash identification — Detects hash type by length and format, returns the hashcat mode number and a ready-to-run crack command. Handles NTLM, Kerberoast, ASREPRoast, bcrypt, Linux shadow, NetNTLMv2.
Reverse shell generation — 11 languages with matching listener commands and shell upgrade instructions
Nmap XML parsing — Paste raw -oX output and get a clean summary
CIDR calculation — Network math from CIDR notation
Wordlist paths — Returns correct SecLists paths so the model stops guessing /usr/share/wordlists/whatever.txt
Quick reference cheat sheets — File transfers, port forwarding, Kerberos attacks, BloodHound, MSSQL, LDAP, SMB relay, shell upgrades

The tool is a single Python file. Import it under Workspace → Tools → + → paste the code → Save. Then assign it to your model under Workspace → Models → edit → Tools section.

All standard library, no external dependencies. The model calls these functions transparently during conversation when it recognizes the need. Ask it to identify a hash and it calls identify_hash(). Ask for a reverse shell and it calls generate_reverse_shell(). No manual invocation required.

A word on security: Open WebUI tools execute Python on the host system. Only install tools you’ve read and understand. Don’t import random community tools without reviewing the source. The convenience isn’t worth the risk if someone’s tool phones home or drops a shell.

System Prompt Engineering

The system prompt defines the model’s behavior for every conversation. Without one, local models default to their training persona — helpful, verbose, full of disclaimers. That’s useless for pentesting work.

I wrote a system prompt that covers:

Non-negotiable tool rules — The specific mistakes I caught during model testing are explicitly corrected in the system prompt. Two-phase nmap is mandatory. ffuf directory fuzzing and vhost fuzzing are defined separately with exact syntax. dig AXFR syntax is specified. The model can’t fall back to training data patterns when the system prompt overrides them.

Response style — Short, technical, no disclaimers, no ethical warnings. If I’m asking about AXFR zone transfers, I don’t need a paragraph about responsible disclosure. I already know.

Knowledge base priority — The system prompt instructs the model to prioritize retrieved context over training data. This is the key behavioral instruction that makes RAG actually work for domain-specific tasks. Without it, models treat retrieved context as supplementary rather than authoritative.

Methodology structure — Enumeration before exploitation. Check for quick wins. Use structured output formats. Parse, don’t regex.

Upload the system prompt as a knowledge base document and also set it directly in the model configuration under Workspace → Models → edit → System Prompt. Belt and suspenders.

What Works and What Doesn’t

Works well:

Generating boilerplate scripts for common tasks (enumeration wrappers, parsing scripts, automation)
Quick reference lookups when the knowledge base contains the answer
Encoding/decoding payloads, identifying hashes, basic calculations
Drafting writeup sections from notes
Explaining unfamiliar concepts or tools when you point it at the right KB

Doesn’t work well:

Novel exploitation without explicit guidance
Multi-step attack chains that require reasoning across phases
Anything where training data contradicts correct methodology (ffuf being the canonical example)
Replacing the operator’s judgment on what to try next

The honest assessment: local models are a force multiplier for the tedious parts of pentesting. They are not a replacement for knowing what you’re doing. The operator who understands methodology and uses AI to accelerate the boring parts will outperform both the operator who ignores AI and the AI that tries to operate without a human.

The Full Picture

Here’s what the complete setup looks like:

Fedora Host
├── Ollama (systemd service, GPU inference)
│   ├── qwen3-coder:30b (primary coding)
│   ├── qwen3:32b (general/conversation)
│   └── qwen3-coder-next (heavy lifting)
├── Open WebUI (Podman container, port 8080)
│   ├── Knowledge Bases
│   │   ├── HackTricks
│   │   ├── PayloadsAllTheThings
│   │   ├── The-Hacker-Recipes
│   │   ├── GTFOBins
│   │   ├── LOLBAS
│   │   └── Personal Notes + Methodology
│   ├── Tools
│   │   └── Offensive Security Toolkit
│   └── Models
│       └── Pentest Assistant (system prompt + tools + KBs)
└── Kali VM (KVM/QEMU)
    ├── Connected to Ollama via NAT (192.168.122.1:11434)
    └── All pentesting tools installed

The Kali VM runs the actual engagements. It connects to Ollama on the Fedora host over the KVM NAT bridge. Any AI-assisted pentesting framework (CAI, Zen-AI-Pentest, or whatever comes next) runs inside the VM where it can execute tools safely without touching the host.

What’s Next

I’m planning a follow-up post comparing local model output against cloud models on a live HackTheBox machine. Same box, same methodology, different AI backing. Concrete side-by-side results.

In the meantime: clone the repos while they’re freely available under permissive licenses. Download the model weights while they’re openly distributed. Build your knowledge bases from your own work. The regulatory landscape is shifting and the window for freely available offensive security tooling won’t stay open forever.

Own your tools.

AI, Offensive Security

This post is licensed under CC BY 4.0 by the author.