R&D and First Field Test of a Pentest AI Agent

Introduction

Since 2025/5, Layer8 has been working on R&D for a pentest AI agent that fully automates web application penetration testing. In 2026/1, we ran its first field test and confirmed that it can find vulnerabilities in real-world environments.

The technology we are developing is intended to validate vulnerabilities only in authorized environments and help service providers improve their defenses. It is not intended for abuse or for attacking third-party systems without permission.

Our pentest AI agent targets black-box and gray-box testing of web applications. The current inputs and execution scope are as follows.

Item	Description
Input	Target URL. For gray-box testing, credentials are also provided.
Execution scope	Initial reconnaissance, vulnerability confirmation, report generation
Output	Vulnerability report
Out of scope	A bot that automatically submits vulnerability reports, a general-purpose internet crawler, white-box testing, network-level scanning

This post covers the initial benchmark evaluation, the first field test results, and the main design decisions behind the system.

Baseline Evaluation on PortSwigger Academy Lab

PortSwigger Web Security Academy is a free web security training platform provided by PortSwigger, the company behind Burp Suite. It provides lab environments for common vulnerability classes such as SQLi, XSS, and SSRF, and is widely used as a yardstick for web pentest capability across both humans and AI agents.

In the early stage of development, we used PortSwigger Academy Lab to improve core behaviors such as exploration, exploitation, and verification. With a one-hour time limit per lab, the agent solved 84% of the labs as of 2025/12/2.

PortSwigger Academy Lab Score

Based on this result, we decided it was time to test whether the agent could find vulnerabilities not only in known training environments, but also in real-world systems.

First Field Test Results

For the first field test, we evaluated the agent through two channels: public programs and a partner company’s staging environment.

Channel	Target	Input	Purpose
HackerOne	Public VDP / BBP programs	URL only	Black-box validation against real systems
Pro bono PoC	Staging environment of one partner company	URL and credentials	Gray-box validation against a real system

HackerOne

HackerOne is a bug bounty platform that aggregates vulnerability disclosure programs operated by companies and government agencies. Programs mainly fall into two categories.

VDP (Vulnerability Disclosure Program): a responsible disclosure channel that generally does not pay bounties. Many government programs fall into this category.
BBP (Bug Bounty Program): a commercial program that pays bounties for valid vulnerability reports.

We ran the agent against 30 target URLs across VDP and BBP programs. Each target was tested as a one-hour black-box test with only the URL as input.

Category	Result
VDP	3 vulnerability reports were triaged (= validated as legitimate vulnerabilities by the receiving organization) in the U.S. Department of Defense (DoD) program
VDP leaderboard	#86 worldwide in the 90-day category
BBP	3 reports submitted, all marked as duplicates and not eligible for bounties

HackerOne Ranking

Based on these results, we have wrapped up the HackerOne experiment for now and are focusing on tuning with our own benchmarks, described later.

Pro Bono PoC

We also ran a pro bono fully autonomous pentest against the staging environment of one partner company. With only a URL and credentials as input, the agent found multiple exploitable vulnerabilities, including a Stored XSS.

Details about the target system and the vulnerabilities are not disclosed due to confidentiality obligations and abuse prevention.

Architecture and Design Decisions

There were two main design challenges in running a pentest AI agent fully autonomously.

How to handle the different requirements of reconnaissance, vulnerability analysis, exploitation, and verification
How to prevent out-of-scope access and excessive request volume without human supervision

With those constraints in mind, the current architecture is shown below. We omit proprietary details and only cover what can be shared publicly.

---
config:
  layout: dagre
  theme: base
---
flowchart LR
    %% Entry Point
    U(("User")) -- Target URL --> SA

    %% Control Plane
    subgraph Control_Plane["Control Plane"]
        SA["Strategist Agent"]
    end

    %% Core Components
    subgraph Blackboard["Blackboard"]
        direction LR
        TQ[("Task Queue")]
        FG[("Findings Graph")]
        HT[("Knowledge Base")]
    end

    %% Tactical Plane
    subgraph Tactical["Tactical Plane"]
        direction LR

        subgraph Agents["　"]
            direction LR
            R["Recon Agents"]
            V["Vulnerability Analysis Agents"]

            subgraph E_Unit["Exploit Unit"]
                direction TB
                E["Exploit Agents"]
                E --> SQLI["SQLi"]
                E --> XSS["XSS"]
                E --> ETC["..."]
            end
        end

        Agents ==> Toolset

        subgraph Toolset["Toolset"]
            direction TB
            BW["Web Browser"]
            SH["Kali Shell"]
            PY["Python"]
            PI["Proxy Inspector"]
        end
    end

    %% Validation Layer
    subgraph Validation["Validation Layer"]
        direction TB
        VA["Verification Agent"]
        RA["Report Agent"]
    end

    %% Infrastructure
    P_Svc{{"MITM Proxy"}}
    TWA(("Target Web App"))

    %% Logic Flow
    SA <--> Blackboard
    SA -. "Dispatch" .-> R & V & E & VA & RA
    SA -- "Push/Pop" --> TQ

    %% Output & Verification Flow
    Tactical -- Exploit PoC --> VA
    VA -- Confirmed --> RA
    Validation <--> Blackboard

    %% Traffic Flow
    Toolset -- Traffic --> P_Svc
    VA -- Verification Traffic --> P_Svc
    P_Svc -- Request --> TWA
    TWA -- Response --> P_Svc

%% Style Definitions
    style U fill:#f8f9fa, stroke:#212529, stroke-width:2px
    style SA fill:#eef6ff, stroke:#007bff, stroke-width:2px
    style Control_Plane fill:#ffffff, stroke:#007bff, stroke-width:2px, stroke-dasharray: 5 5

    style Blackboard fill:#f8f9ff, stroke:#5c6bc0, stroke-width:2px
    style TQ fill:#ffffff, stroke:#5c6bc0, stroke-width:1px
    style FG fill:#ffffff, stroke:#5c6bc0, stroke-width:1px
    style HT fill:#ffffff, stroke:#5c6bc0, stroke-width:1px

    style Tactical fill:#ffffff, stroke:#333333, stroke-width:2px, stroke-dasharray: 5 5
    style R fill:#f0f7ff, stroke:#2196f3, stroke-width:1px
    style V fill:#f3e5f5, stroke:#9c27b0, stroke-width:1px
    style E_Unit fill:#fff5f5, stroke:#f44336, stroke-width:1px, stroke-dasharray: 2 2
    style E fill:#ffebee, stroke:#f44336, stroke-width:1px
    style SQLI fill:#ffffff, stroke:#f44336, stroke-width:1px
    style XSS fill:#ffffff, stroke:#f44336, stroke-width:1px
    style ETC fill:#ffffff, stroke:#f44336, stroke-width:1px

    style Toolset fill:#fffaf0, stroke:#fb8c00, stroke-width:2px
    style P_Svc fill:#e8f5e9, stroke:#4caf50, stroke-width:2px
    style TWA fill:#f5f5f5, stroke:#757575, stroke-width:2px
    style Validation fill:#ffffff, stroke:#28a745, stroke-width:2px, stroke-dasharray: 5 5

In the diagram, Recon / Vulnerability Analysis / Exploit are not single agents. They represent groups of agents split by role. The high-level flow is as follows.

Step	Main owner	Description
1. Strategy planning	Strategist Agent	Looks at shared state and decides the next task to execute
2. Task execution	Recon / Vulnerability Analysis / Exploit Agents	Performs individual tasks such as reconnaissance, vulnerability analysis, and exploitation
3. Verification	Verification Agent	Reproduces candidate findings and filters out false positives
4. Report generation	Report Agent	Turns confirmed vulnerabilities into a reportable form

Each step proceeds while referring to shared state.

The two most important design decisions are described below.

Why We Chose a Multi-Agent System

Our pentest AI agent does not use a single agent to do everything. It is built as a multi-agent system, where agents specialized for different phases collaborate.

Some open-source projects for fully autonomous web pentesting, such as westonbrown/Cyber-AutoAgent, use a single-agent architecture. The right choice depends on the goal, but we chose a multi-agent system for the following reasons.

Per-phase tuning is easier: models, system prompts, toolsets, and guardrails can be optimized independently for each phase. Reconnaissance and exploitation require different capabilities and different guardrails, so this flexibility matters.
Phase-specific false-positive checkers can be added: mechanical filters can be optimized for each phase.
State management can be separated: some phases benefit from their own state management or forgetting behavior.

The trade-off is that managing multiple agents and shared state adds complexity. Even so, we concluded that the benefits outweigh the cost. XBOW also uses a multi-agent system, which suggests that this direction can work at practical scale.

Guardrails for Fully Autonomous Execution

Because we chose full automation without a Human-in-the-Loop, guardrails are essential for safe operation without human supervision.

We implement guardrails in two layers.

Prompt-level guardrails: each agent’s system prompt explicitly states the allowed scope and prohibited behaviors.
System-level guardrails: all traffic passes through a man-in-the-middle proxy (L7 Egress Gateway), which mechanically blocks out-of-scope hosts, excessive request rates, unknown protocols, and similar cases.

If we rely only on prompt-level guardrails, safety breaks as soon as the LLM ignores or misunderstands an instruction. Full automation requires system-level guardrails that enforce constraints on a path independent of the LLM’s behavior. Requests and responses that pass through the MITM Proxy are also recorded and automatically sent to scanners in a later stage.

Why Existing Benchmarks Were Not Enough

To keep iterating on per-phase tuning, which is one of the advantages of a multi-agent system, we need benchmarks that can measure performance differences at the phase level.

For pentest AI agents, benchmarks such as PortSwigger Academy Lab and XBOW Benchmark are used for performance evaluation. They are useful for measuring end-to-end attack-chain strength.

At the same time, they did not fully match our improvement loop for the following reasons.

As model capabilities improve, existing benchmarks are starting to saturate at the upper end.
They make it hard to measure fine-grained tuning differences, such as initial reconnaissance with a Web Browser or safety checks for exploitation.
They make it hard to detect regressions where only a specific phase gets worse after switching models.

For this reason, we developed our own benchmarks for phase-level performance improvement and cost optimization, and use them for continuous evaluation. We plan to share the details in a future post.

Closing

In the first field test, we confirmed a certain level of usefulness against real-world environments. We will continue R&D toward automating more advanced attack chains and supporting a broader range of targets.

We are looking for companies interested in participating in proof-of-concept testing of our pentest AI agent. If you are interested, please contact us.