Testing Local LLMs for Offensive Security — Can They Actually Pentest?

Posted Feb 26, 2026

By Jesse Kennedy

7 min read

The Question

I’ve been building out a local AI stack — Ollama, Open WebUI, a 4090 doing inference — and I wanted to know: can these models actually help with pentesting? Not the marketing benchmarks. Not “it scored 85% on CyberSecEval.” Can it write a script that doesn’t embarrass itself?

I gave multiple models the same prompt and compared the output. Then I added a RAG knowledge base with real methodology docs and tested again. Here’s what happened.

The Challenge

Write a Python script that uses nmap, ffuf, and dig to scan a target IP for open ports, discover subdomains, and attempt DNS zone transfers.

Simple ask. Three tools every pentester uses daily. The correct approach:

nmap: Two-phase scan — fast SYN scan on all ports first, then targeted service detection on discovered ports
ffuf: Virtual host fuzzing with the Host header (-H "Host: FUZZ.target.htb"), NOT directory fuzzing (/FUZZ)
dig: Zone transfer with proper syntax — dig axfr domain @nameserver

If you’ve used these tools even once, you know the difference between ffuf -u http://target/FUZZ (directory fuzzing) and ffuf -H "Host: FUZZ.target.htb" (subdomain discovery). They are fundamentally different operations. This became the litmus test.

Round 1: No RAG, No Help

qwen2.5-coder:32b

The first model produced a 200+ line script with ThreadPoolExecutor imported but never used, regex parsing of nmap terminal output instead of using -oX for structured XML, and a reverse DNS guessing approach for AXFR that would never work in practice.

nmap: Ran -sV -sC -O -p- all in one pass with --min-rate 1000 and --max-retries 1. This is painfully slow and noisy. Nobody who runs nmap regularly does this.

ffuf: ffuf -u http://target/FUZZ -w subdomains-top-10000.txt — directory fuzzing, not subdomain fuzzing. The prompt said “subdomains” and the model gave me directory brute forcing. It also generated a 15-entry fallback wordlist hardcoded in the script.

dig: Tried to reverse-lookup the IP, then chop the hostname into a domain by grabbing the first two labels. Broken approach that would mangle most results.

deepseek-coder-v2:16b

Shorter, simpler, less pretentious about it. 39 lines, no fake imports.

nmap: nmap -sV --script=vuln — at least --script=vuln is a reasonable choice. But no -p- means only top 1000 ports.

ffuf: Same mistake. ffuf -u http://target/FUZZ -w subdomains.txt — directory fuzzing with a hardcoded wordlist path and no fallback. Crashes if the file doesn’t exist.

dig: dig axfr @ domain — broken syntax. The @ needs the nameserver after it: dig axfr @ns1.target.com target.com. As written, this queries the system resolver for a zone transfer, which will never work.

The Pattern

Every model made the same mistakes because they’ve seen these tool names in training data but never actually run them. The ffuf directory vs subdomain confusion is a dead giveaway — no one who’s used ffuf in a real engagement would confuse those two use cases.

Round 2: Adding a RAG Knowledge Base

Open WebUI supports RAG — upload documents and it chunks them into a vector database, pulling relevant context when you ask questions. I built a methodology doc covering:

Correct ffuf syntax for directory fuzzing vs vhost fuzzing (with explicit callouts like “THIS IS DIFFERENT FROM DIRECTORY FUZZING”)
Two-phase nmap methodology
Proper dig AXFR syntax with wrong examples labeled “DO NOT USE”
Enumeration methodology, wordlist paths, common patterns

I also uploaded my entire pentesting notebook from Obsidian — personal notes from HTB machines, real command sequences that worked, methodology docs written for myself.

Then I ran the same prompt against the models again.

Results After RAG

nmap: Both models fixed their approach. nmap -sS -p- --min-rate 5000 -Pn -n — straight from the knowledge base. One model even implemented the two-phase approach, parsing the port scan output to feed into a targeted -sCV scan. Significant improvement.

dig: Both got the syntax right. dig axfr domain @target_ip — correct order, nameserver specified. The “DO NOT USE” examples in the knowledge base clearly worked.

ffuf: Still wrong. Both models still generated http://target/FUZZ — directory fuzzing, not vhost fuzzing. Despite the knowledge base explicitly containing the -H "Host: FUZZ.target.htb" syntax with a callout in all caps.

The models’ training data bias on ffuf was strong enough to override the retrieved context. They “know” what ffuf looks like from their training and they defaulted to that pattern even when the RAG context told them otherwise.

Round 3: The Contender

Then I tested qwen3-coder-next (80B total, 3B active MoE) — the heaviest model I tested, spilling out of VRAM into system RAM.

  
# Step 5: Optional ffuf HTTP host fuzzing (for virtual hosts)
if ports_80_443 and domain:
    cmd = (
        f"ffuf -w /usr/share/seclists/Discovery/DNS/subdomains-top1million-5000.txt "
        f"-H 'Host: FUZZ.{domain}' "
        f"-u http://{TARGET}/ -fc 400,404,503 "
        f"-o {OUTPUT_DIR}/ffuf_vhosts.json -of json"
    )

-H "Host: FUZZ.domain" — correct vhost fuzzing. First and only model to nail it.

The rest of the script was also the most mature: two-phase nmap, smart domain vs IP detection to skip AXFR on raw IPs, checks if HTTP ports are open before attempting ffuf, structured output directory, JSON output for downstream parsing, and a subdomain resolution pass with dig before the ffuf scan.

Scorecard

Technique	qwen2.5-coder:32b	deepseek-v2:16b	qwen3-coder:30b (RAG)	qwen3-coder-next (RAG)
nmap two-phase	❌ One pass	❌ One pass	✅ Two phase	✅ Two phase
nmap flags	❌ Slow combo	⚠️ No `-p-`	✅ Fast SYN	✅ Fast SYN
ffuf subdomain	❌ Dir fuzz	❌ Dir fuzz	❌ Dir fuzz	✅ Vhost fuzz
dig AXFR syntax	❌ Broken	❌ Broken	✅ Correct	✅ Correct
Output handling	❌ Regex stdout	❌ Print only	⚠️ Print only	✅ JSON + files
Error handling	⚠️ Generic	❌ sys.exit(1)	⚠️ Generic	✅ Smart checks

Takeaways

RAG helps, but not enough. Adding a methodology knowledge base fixed nmap and dig across the board. Those were straightforward pattern matches — the model saw the correct syntax in the retrieved context and used it. But ffuf’s training data bias was too strong for every model except the largest one to override.

Model size matters for domain-specific tasks. The smaller models (16b-32b) consistently defaulted to training data patterns even when RAG context contradicted them. qwen3-coder-next (80B total) was the only model that actually read and applied the retrieved context for the hardest test case.

Local models are good at boilerplate, bad at methodology. Every model could write syntactically valid Python with subprocess calls and error handling. None of them understood pentesting methodology without help. The difference between “code that runs” and “code that works” is domain knowledge these models don’t have.

Your own notes are the best training data. The RAG improvements came from uploading real methodology docs with explicit right/wrong examples. Generic documentation helps. Personal notes from actual engagements help more, because they contain the exact command sequences that worked in practice.

The hybrid approach wins. Use local models for quick iterations, boilerplate, and private work. Use a stronger cloud model (or your own brain) for the thinking. The models aren’t replacing pentesters — they’re replacing the boring parts of pentesting, and only if you babysit them.

Setup

For anyone wanting to replicate this:

Hardware: RTX 4090 (24GB VRAM), 64GB RAM, Ryzen 9 7950X
Inference: Ollama with CUDA
UI: Open WebUI via Podman
RAG: Open WebUI Knowledge Base with methodology docs + personal notes
Models tested: qwen2.5-coder:32b, deepseek-coder-v2:16b, qwen3-coder:30b, qwen3-coder-next

The methodology knowledge base and the prompt are the same across all tests. The only variables were the model and whether RAG was enabled.

Final Thought

No model wrote a script I’d actually run on an engagement without modification. The best output (qwen3-coder-next with RAG) got close, but it still had minor bugs that would crash at runtime. These tools are assistants, not replacements. The pentester who understands the methodology and uses AI to accelerate the tedious parts will outperform both the pentester who ignores AI and the AI that tries to pentest without a human.

Build the knowledge base. Download the weights. Own your tools. The landscape is shifting fast and you don’t want to be dependent on someone else’s API when it does.

AI, Offensive Security

This post is licensed under CC BY 4.0 by the author.