LLMtary (Elementary) — AI-Powered Penetration Testing Platform

Autonomous, LLM-driven penetration testing for security professionals. From passive recon to active exploit validation and professional report generation — all running locally, all under your control.

⬇️ Download

No Flutter or developer tools required — grab the installer for your platform directly from the Releases page:

Platform	Package
🪟 Windows	`.exe` installer
🍎 macOS (Apple Silicon)	`.dmg` installer
🐧 Linux — Debian / Ubuntu	`.deb` package
🐧 Linux — RHEL / Fedora / CentOS / Alma	`.rpm` package
🐧 Linux — Arch	`.pkg.tar.zst` package
🐧 Linux — openSUSE	`.rpm` package

Building from source? See Setup at the bottom.

What Is LLMtary?

LLMtary is an open-source Flutter desktop application that brings large language model intelligence to every phase of a penetration test. Enter a target — an IP, hostname, FQDN, or CIDR range — and LLMtary autonomously runs recon, identifies vulnerabilities across dozens of attack categories, then executes and validates each finding — all without leaving your machine.

Whether you're running a local model on your own GPU with zero data leaving your network, or leveraging a cloud frontier model for maximum accuracy, LLMtary provides a structured, agentic testing loop that mirrors how a real engagement works: passive recon → service fingerprinting → vulnerability discovery → targeted exploitation → post-exploitation → professional reporting.

Why LLMtary?

Local-first AI pentesting — run entirely on-premise with Ollama or LM Studio; no cloud required
Structured agentic loop — not just suggestion generation; LLMtary actually runs commands, reads output, and iterates
Multi-phase enrichment — Phase 1 findings feed into Phase 2 prompts, producing sharper, more targeted vulnerabilities
Production-quality output — CVSS metadata, business risk scoring, BloodHound-style AD attack chains, and professional HTML/Markdown reports
Cross-platform — native desktop builds for Linux, macOS, and Windows

Screenshots

Scope / Recon
Scope and Recon Screen

Proof / Exploit
Proof and Exploit Screen

Findings Summary
Findings Summary Report

Finding Detail
Finding Detail Report

Key Features

Autonomous Reconnaissance Engine

LLMtary's built-in ReconService drives initial data collection through an LLM-guided loop — running port scans, service banner grabs, web fingerprinting, DNS enumeration, WAF detection, OS detection, and certificate extraction — then merging all findings into structured JSON that feeds the analysis pipeline. No manual data entry required.

Two-Phase Vulnerability Analysis Pipeline

Analysis mirrors the structure of a professional engagement, with Phase 1 results enriching every Phase 2 prompt:

Phase 1 — Passive Recon & Service Fingerprinting (fast, always runs):

CVE/version analysis — strict product+version matching against known vulnerability ranges
Network service analysis — SMB, SSH, FTP, databases, SNMP/management protocols, WinRM/WMI, IPv6
DNS/OSINT and email security — zone transfers, subdomain recon, SPF/DMARC gaps (external targets)

Phase 2 — Full Vulnerability Analysis (enriched with Phase 1 context):

Web application — four focused passes: core injection/CMS/auth weaknesses, API/CORS/JWT/GraphQL/OAuth, business logic/SSTI/request smuggling/security headers, secrets and configuration exposure
Active Directory — three focused passes: credential attacks (Kerberoasting, LDAP null bind, password spraying), privilege escalation (ADCS, ACL abuse, delegation attacks), lateral movement (relay attacks, Pass-the-Hash/Ticket, WinRM)
SSL/TLS — cipher strength, protocol versions, certificate validity, known TLS vulnerability classes
Privilege escalation — OS-level paths: sudo misconfiguration, SUID binaries, service permissions, scheduled tasks, registry abuse, token impersonation
Technology deep-dives — WordPress, Jenkins, Atlassian, Tomcat, Exchange, Elasticsearch, VMware, GitLab, Citrix, Drupal, MSSQL, ADCS, WAF bypass — each fires only when indicators for that technology are present

Post-analysis:

Deduplication, evidence-quote validation, and severity/confidence/business-risk sort
If ≥2 HIGH/CRITICAL Active Directory findings are found, a BloodHound-style attack chain reasoning pass fires to identify multi-step paths to Domain Administrator

Agentic Exploit Testing Loop

Each selected finding goes through a full autonomous validation loop:

The LLM plans its approach, executes real shell commands, reads command output, evaluates results, and adapts — progressing through RECON → VERIFICATION → EXPLOITATION → CONFIRMATION phases
Configurable iteration caps per finding type (CVE-backed vs. speculative)
Duplicate command and semantic approach exhaustion detection prevent infinite loops
OPSEC-aware prompting — every iteration includes guidance on request pacing, scan noise reduction, tool signature minimization, and test impact limits
Rate-limit detection — detects 429, "too many requests", "you have been blocked" signals in command output; notifies the LLM and adjusts its next approach automatically
Two-tier pre-execution command validation:
- Tier 1 (static, zero cost) — blocks non-script file execution, pipe-to-shell patterns (curl url | bash), and auto-corrects Windows paths in WSL contexts
- Tier 2 (LLM-assisted, cached) — validates correct flag usage for high-risk tools (nmap, sqlmap, gobuster, hydra, nuclei, metasploit, etc.); first call per tool costs one LLM round-trip, all subsequent calls are free; 20-second timeout

Attack Chain Reasoning

When a vulnerability is confirmed, the executor identifies whether it enables or simplifies testing another vulnerability and notes chain opportunities in the finding's proof
Confirmed artifacts (RCE, SQLi, auth bypass, LFI, SSRF) are fed forward as context — subsequent vulnerabilities know what access has already been achieved
After all loops complete, if ≥2 findings are confirmed, a post-execution chain reasoning pass fires — identifying how confirmed findings combine into higher-impact multi-step attack paths, added as AttackChain findings

Post-Exploitation Enumeration

When a confirmed finding grants high-value access (RCE, command injection, authentication bypass, default credentials), a Post-Exploitation Enumeration sequence is automatically queued — enumerating users, groups, network interfaces, running services, readable credential files, and privilege escalation paths, documenting the full blast radius of the access achieved.

Session-Wide Credential Bank

Discovered credentials are collected into a session-wide bank and automatically included as context when testing subsequent vulnerabilities on the same target — enabling real-world credential reuse and chained attack paths
Deduplicates by service/host/username fingerprint
Verified credentials (confirmed in command output) are persisted to SQLite; inferred credentials are labeled unverified in prompts
When verified credentials are discovered, an authenticated re-analysis pass automatically runs for the affected target to surface additional findings that require credentials

Multi-Scope Target Classification

Automatically classifies targets as internal (RFC-1918, LAN hosts) or external (internet-facing FQDNs, public IPs)
Fires different prompt sets per scope — external targets get SSL/TLS, DNS/OSINT, CDN/WAF-aware analysis; internal targets get network service and AD-focused analysis
Prevents cross-scope noise — no SMB findings on external hosts, no subdomain takeover findings on internal hosts

Command Approval Mode

Optional mode that pauses before every command execution and shows it to the user for review
Options: Allow Once, Always Allow (adds to whitelist), or Deny (LLM is notified to try a different approach)
The toggle takes effect immediately — enabling or disabling mid-execution applies to the very next command

Project Management & Reporting

Multiple named projects, each with multiple targets
All findings, command logs, and credentials persisted in SQLite per project/target
Export and import projects as encrypted .penex bundles (AES encryption, password-protected)
HTML report — professional formatted report with cover page, executive summary, severity breakdown, full findings with CVSS metadata, and credential table
Markdown report — same content in portable format
CSV export — flat findings list for spreadsheets and other tools
AI-assisted generation for executive summary, methodology, risk rating model, and conclusion sections

Cross-Platform Desktop App

Native builds for Linux, macOS, and Windows
On Windows, detects WSL availability and uses bash via WSL; falls back to PowerShell/cmd with Windows-native commands when WSL is absent
OS detection informs the LLM's command choices throughout the testing loop

Safety Controls

LLMtary includes multiple layers of protection against accidental or destructive execution:

Control	Description
Dangerous command blocklist	Hard-blocks destructive patterns: `rm -rf /`, `format`, `mkfs`, `dd if=`, `shutdown`, `reboot`, fork bombs, and similar — regardless of LLM output
Non-script file execution detection	Blocks attempts to execute non-script files as shell scripts (e.g. passing a wordlist as a bash argument)
Pipe-to-shell blocking	Blocks `cat file \\| bash`, `curl url \\| bash`, and similar patterns
Command approval mode	When enabled, every command is shown to the user before execution
Configurable command whitelist	Commands added to the whitelist always execute without prompting
Per-tool setup validation	Tools requiring initialization (e.g. `msfdb init`) are checked before use; skipped for the session if unavailable
Connection timeout protection	When a port connection times out, the executor is instructed not to retry that port
Sensitive output sanitization	Command output is scrubbed of credentials, API keys, and tokens before storage and display
Scope enforcement	Findings are validated against configured scope and exclusion lists

Supported AI Providers

Provider	Type	Notes
Ollama	Local	Default: `http://localhost:11434` — fully offline, no data leaves your network
LM Studio	Local	Default: `http://localhost:1234/v1` — fully offline, GPU-accelerated
Claude (Anthropic)	Cloud	API key required
ChatGPT (OpenAI)	Cloud	API key required
Gemini (Google)	Cloud	API key required
OpenRouter	Cloud	API key required; access to many models via one API
Custom	Any	Configurable base URL and API key

Provider settings are saved per-provider — switching providers restores that provider's previously saved API key, model, and base URL.

LLM Requirements

LLMtary's prompts are large and require strong reasoning. The exploit system prompt alone exceeds 6,000 tokens and the recon system prompt exceeds 5,600 tokens — a 4K context window is physically too small.

Local Model Tiers (Ollama / LM Studio)

Tier	Size	JSON Reliability	VRAM (Q4)	Recommendation
Not usable	7–8B	~50–60%	~6 GB	Hallucinates flags, can't chain reasoning
Bare minimum	14B	~70–80%	~12 GB	Handles simple targets; expect some retries
Recommended	32B	~85–90%	~24 GB	Solid multi-step reasoning and exploit chains
Professional	70B+	~95%+	~48 GB	Best local results; dual GPU setups work well

Cloud Providers

Cloud providers (Claude, ChatGPT, Gemini, OpenRouter) handle all compute remotely — any machine that can run the Flutter desktop app is sufficient.

Provider	Recommended Models
Anthropic	Claude Opus, Claude Sonnet
OpenAI	GPT-4o, GPT-4 Turbo
Google	Gemini 1.5 Pro, Gemini 2.5 Pro
OpenRouter	Any of the above via unified API

Cloud providers offer the best results due to large context windows (128K–200K tokens) and frontier-class reasoning — recommended when local hardware is limited.

Setup

Install from Pre-Built Package (Recommended)

Download the installer for your platform from the Releases page. No Flutter or developer tools required.

Platform	Install
🪟 Windows	Run the `.exe` installer
🍎 macOS	Open the `.dmg` and drag to Applications
🐧 Debian / Ubuntu	`sudo dpkg -i llmtary_*.deb`
🐧 RHEL / Fedora / CentOS / Alma	`sudo rpm -i LLMtary-*-RH-Fed-Cent_Alma.x86_64.rpm`
🐧 Arch	`sudo pacman -U LLMtary-*-x86_64.pkg.tar.zst`
🐧 openSUSE	`sudo rpm -i LLMtary-*-opensuse.x86_64.rpm`

Build from Source (Developers)

Requires Flutter SDK (stable channel) with desktop support enabled.

flutter config --enable-linux-desktop   # or: macos-desktop / windows-desktop
git clone https://github.com/chetstriker/LLMtary.git
cd LLMtary
flutter pub get
flutter run -d linux     # or: macos, windows

Release build:

flutter build linux      # or: macos, windows

Recommended Pentest Tools

LLMtary shells out to whatever tools are installed on your system. The more tools available, the more testing approaches the LLM can take:

nmap — port scanning and service version detection
nuclei — template-based vulnerability scanning
curl / wget — HTTP request crafting and testing
dig / host — DNS enumeration
smbclient / enum4linux — SMB enumeration (internal)
searchsploit — local exploit database search
sqlmap — SQL injection testing
gobuster / ffuf / dirb — directory and path enumeration
hydra / medusa — credential brute-forcing
testssl.sh / sslscan — TLS configuration analysis
nikto — web server vulnerability scanning
metasploit — exploit framework (requires msfdb init)

If a tool is missing, the LLM will attempt to install it automatically. If installation fails, it adapts and tries alternative approaches.

Usage

1. Configure AI Settings

Click the settings icon (top right). Select your AI provider, enter your API key and model name. A Test button validates the configuration. Settings are saved per-provider.

2. Create or Select a Project

Projects organize your work. Create a named project to get started.

3. Define Your Scope

On the SCOPE / RECON tab, enter your targets in the In-Scope Targets field — accepts IPs, hostnames, FQDNs, and CIDR ranges, comma or newline separated. You can also import a target list from a file (one address per line). Optionally add exclusions and Rules of Engagement, then press GO to start autonomous recon.

4. Analyze

Navigate to the VULN / HUNT tab and click Analyze. Multiple LLM passes run in parallel. Findings appear in the vulnerability table as they arrive, sorted by severity.

5. Select and Execute

Navigate to the PROOF / EXPLOIT tab. Check the vulnerabilities you want to test and click Execute Selected. Status icons update in real time:

[PENDING] — not yet tested
[CONFIRMED] — exploitation succeeded with proof
[NOT VULNERABLE] — definitively ruled out
[UNDETERMINED] — tested but inconclusive

6. Review and Export

Navigate to the RESULT / REPORT tab to generate your report as HTML, Markdown, or CSV.

Scan Data JSON Format

{
  "device": {
    "ip_address": "192.168.1.100",
    "name": "webserver01",
    "os": "Linux",
    "os_version": "Ubuntu 22.04"
  },
  "open_ports": [
    {
      "port": 80,
      "protocol": "tcp",
      "state": "open",
      "service": "http",
      "product": "Apache httpd",
      "version": "2.4.49",
      "extra_info": "(Ubuntu)",
      "cpe": "cpe:/a:apache:http_server:2.4.49"
    }
  ],
  "web_findings": [
    {
      "url": "http://192.168.1.100:80",
      "status": 200,
      "technologies": ["WordPress 6.2", "PHP 8.1"]
    }
  ],
  "dns_findings": [
    { "record_type": "MX", "name": "example.com", "value": "10 mail.example.com" }
  ],
  "waf_findings": [
    { "waf": "Cloudflare", "detected_by": "cf-ray header" }
  ]
}

All fields beyond device and open_ports are optional but significantly improve analysis quality.

Findings Schema

Each vulnerability finding includes:

Field	Description
`problem`	Short descriptive name
`cve`	CVE ID if applicable
`description`	Attack technique, affected path/parameter, example payload, and what an attacker gains
`severity`	`CRITICAL` / `HIGH` / `MEDIUM` / `LOW`
`confidence`	`HIGH` / `MEDIUM` / `LOW`
`evidence`	Exact data from scan output supporting the finding
`recommendation`	Remediation guidance
`vulnerabilityType`	Attack class (RCE, SQLi, XSS, LFI, Auth Bypass, etc.)
`businessRisk`	Real-world business impact — data breach, ransomware pivot, regulatory exposure, operational disruption
CVSS fields	`attackVector`, `attackComplexity`, `privilegesRequired`, `userInteraction`, `confidentialityImpact`, `integrityImpact`, `availabilityImpact`

Architecture

lib/
├── constants/
│   └── app_constants.dart              # Color palette, settings keys, config defaults
├── database/
│   └── database_helper.dart            # SQLite persistence (projects, targets, vulns, logs)
├── models/
│   ├── vulnerability.dart              # Finding model with CVSS fields and status
│   ├── command_log.dart                # Shell command execution record
│   ├── credential.dart                 # Discovered credential
│   ├── target.dart / project.dart      # Target and project containers
│   ├── llm_settings.dart               # AI provider configuration
│   └── llm_provider.dart               # Provider enum with metadata
├── screens/
│   ├── home_screen.dart                # Project selection and management
│   ├── main_screen.dart                # Primary workspace with tab navigation
│   ├── settings_screen.dart            # AI provider and execution settings
│   └── tabs/
│       ├── scope_recon_tab.dart        # Autonomous recon and target management
│       ├── vuln_hunt_tab.dart          # Vulnerability analysis pipeline
│       ├── proof_exploit_tab.dart      # Exploit execution and results
│       └── result_report_tab.dart      # Report generation
├── services/
│   ├── vulnerability_analyzer.dart     # Multi-prompt parallel analysis pipeline
│   ├── exploit_executor.dart           # Autonomous exploit testing loop (~91KB)
│   ├── recon_service.dart              # LLM-guided recon data collection
│   ├── prompt_templates.dart           # All analysis and execution prompts (~70KB)
│   ├── command_executor.dart           # Shell execution with safety controls (~48KB)
│   ├── llm_service.dart                # LLM API client (all 6 providers)
│   ├── report_generator.dart           # HTML/Markdown/CSV report generation
│   ├── report_content_service.dart     # AI-assisted report section generation
│   ├── project_porter.dart             # Encrypted project export/import
│   ├── storage_service.dart            # File system path management
│   ├── tool_manager.dart               # Tool availability detection and caching
│   ├── background_process_manager.dart # Long-running listeners (Responder, ntlmrelayx)
│   └── environment_discovery.dart      # OS/environment detection
├── utils/
│   ├── device_utils.dart               # Target IP extraction and scope classification
│   ├── command_utils.dart              # Command history, deduplication, approach tracking
│   ├── command_validator.dart          # Two-tier pre-execution command validation
│   ├── cvss_calculator.dart            # CVSS score computation
│   ├── json_parser.dart                # Robust JSON extraction from LLM responses
│   ├── output_sanitizer.dart           # Sensitive data redaction
│   └── app_exceptions.dart             # Typed exception hierarchy
└── widgets/
    ├── app_state.dart                  # Global ChangeNotifier state (Provider)
    ├── vulnerability_table.dart        # Sortable findings table with status indicators
    ├── command_log_panel.dart          # Real-time command output viewer
    ├── prompt_log_panel.dart           # LLM prompt/response inspector
    ├── debug_log_panel.dart            # Internal debug event stream
    ├── command_approval_widget.dart    # Approval mode command review UI
    └── results_modal.dart              # Post-execution findings summary

Settings Reference

Setting	Description
AI Provider	Which LLM backend to use
Base URL	API endpoint (local providers and Custom)
API Key	Authentication for cloud providers
Model	Model identifier string
Temperature	LLM sampling temperature (default 0.22 — lower = more deterministic)
Max Tokens	Maximum tokens per LLM response (default 4096)
Timeout	Seconds before an LLM request times out (default 240s)
Max Iterations (with CVE)	Exploitation loop cap for findings with a known CVE ID
Max Iterations (no CVE)	Exploitation loop cap for generic findings without a CVE ID
Require Approval	Pause and prompt before executing each shell command
Command Whitelist	Commands that bypass the approval prompt
Storage Path	Base directory for scan output file storage

Disclaimer

Authorized Use Only

LLMtary executes real shell commands on your local machine against real targets. It is intended exclusively for use by security professionals in authorized penetration testing engagements, security research, and CTF competitions.

You are solely responsible for ensuring you have explicit written authorization to test any target. Unauthorized scanning, enumeration, or exploitation of systems is illegal in most jurisdictions under computer fraud and unauthorized access laws, regardless of the tools used. The authors and contributors of LLMtary accept no liability for misuse of this software.

Built-In Protections Are Not a Substitute for Judgment

LLMtary includes multiple layers of built-in safety controls — a dangerous command blocklist, pipe-to-shell blocking, non-script execution detection, connection timeout protection, and an optional command approval mode that requires you to review and authorize each command before it runs. These controls are designed to reduce the risk of accidents during authorized testing.

However, these protections are safeguards, not guarantees. They are intended to catch common accidental or destructive patterns — they are not a substitute for professional judgment, proper engagement scoping, and adherence to your rules of engagement. The command approval mode can be disabled; if you choose to disable it, you are accepting full responsibility for every command the system executes. The authors and contributors of LLMtary accept no liability for any damage, data loss, service disruption, or other harm resulting from the execution of commands generated or run by this software.

API Cost Responsibility

LLMtary supports both locally hosted models (Ollama, LM Studio — free, offline, no usage costs) and cloud AI providers (Anthropic Claude, OpenAI ChatGPT, Google Gemini, OpenRouter). When using cloud providers, LLMtary makes API calls that are billed to your account by the respective provider.

LLMtary's analysis pipeline runs multiple LLM passes in parallel, and the exploit testing loop can run many iterations per vulnerability. Token consumption can be significant, particularly on large target sets, high iteration caps, or with large language models. You are solely responsible for all API usage costs incurred through your use of LLMtary. Monitor your usage and billing dashboards with your cloud provider. The authors and contributors of LLMtary accept no responsibility for unexpected charges resulting from the use of this software.

License

LLMtary is released under the MIT License. See LICENSE.txt for the full license text.

Contributing

Contributions, bug reports, and feature requests are welcome. Please open an issue or pull request on GitHub.

LLMtary

About LLMtary

Platforms

Languages

Links

README.md