LLM Agent-Based Vulnerability Discovery — How Mythos Reframes Security Testing

Mozilla CTO Bobby Holley called it “vertigo.” In April 2026, Anthropic Mythos Preview read the Firefox 150 source code and found 271 vulnerabilities — without writing a single fuzzing harness.

What’s more interesting than the number is the method. AFL mutates bytes repeatedly. Mythos reasons about why code is vulnerable. Both operate under the same label of “automated security testing,” but at fundamentally different layers. This piece examines what that difference means, where the reversal happens, and where the limits emerge.

Structural Limits of Coverage-Guided Fuzzing

AFL (American Fuzzy Lop) and libFuzzer have been the foundation of open-source security since 2014. Google OSS-Fuzz runs continuous fuzzing on over 10,000 projects atop this foundation and has discovered tens of thousands of vulnerabilities. It remains powerful. But there is one structural bottleneck.

Fuzzing requires something to execute. Coverage-guided fuzzing needs two things to work.

First, the program must be running — not merely compiling, but able to receive and process arbitrary inputs. Second, those execution paths must be observable via instrumentation: the fuzzer must be able to track which code edges were executed.

The work of satisfying both conditions is harness writing. A fuzzing harness is wrapper code that packages a target function or library into a “fuzzable unit.”

The Harness Bottleneck: How Slow Is It, Really?

A basic libFuzzer harness looks like this:

// my_parser_fuzzer.cc
// libFuzzer entry point: LLVMFuzzerTestOneInput
#include <stddef.h>
#include <stdint.h>
#include "my_parser.h"

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    // inject fuzzer-generated arbitrary byte stream into the parser
    MyParser parser;
    parser.ParseBuffer(data, size);
    return 0;  // non-zero is treated as a crash
}

That’s simple enough. But real harness writing is far more complex:

// Network protocol parser harness — real-world example
#include "net/http/http_response_headers.h"
#include "base/strings/string_piece.h"

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    // 1. Convert raw bytes into the format the protocol expects
    if (size < 4) return 0;
    
    // 2. Initialize state — set up global state or context the parser depends on
    base::StringPiece input(reinterpret_cast<const char*>(data), size);
    
    // 3. Memory management — avoid conflicts with sanitizers
    scoped_refptr<net::HttpResponseHeaders> headers =
        base::MakeRefCounted<net::HttpResponseHeaders>(std::string(input));
    
    // 4. Call all relevant methods to open code paths of interest
    headers->GetNormalizedHeaders(nullptr);
    std::string value;
    headers->GetNormalizedHeader("content-type", &value);
    
    return 0;
}

Writing this harness requires knowing the internal structure of net::HttpResponseHeaders, the memory semantics of base::StringPiece, and which methods open security-relevant code paths. In other words, you have to read and understand the code first.

This is the common complaint among OSS-Fuzz contributors: thousands of entry points sit in a “low coverage due to no harness” state. Large codebases like Firefox, Chrome, and OpenSSL have hundreds of functions that could harbor vulnerabilities, but harnesses don’t exist for each one — because writing them takes hours to days per target.

How Mythos Works: Step-by-Step Analysis

Mythos bypasses the harness-writing step entirely. Instead of randomly exploring execution paths, it reads source code and first reasons about where bugs are likely to be.

                    ┌─────────────────────────────────────────┐
                    │           Mythos Execution Pipeline       │
                    └─────────────────────────────────────────┘

 Source code + binaries
       │
       ▼
 ┌─────────────┐    Score files for vulnerability potential
 │ File priority│    1 (low) to 5 (high)
 │   scoring   │    Memory manipulation, parsing logic, privilege boundaries
 └──────┬──────┘
        │ High-score files first
        ▼
 ┌─────────────┐    Parallel isolated containers
 │  Parallel   │    Each VM: independent source + binary environment
 │ ephemeral   │    Failures don't affect other VMs
 │    VMs      │
 └──────┬──────┘
        │
        ▼
 ┌─────────────┐    Read code → form hypothesis
 │Code reasoning│   "This bounds check looks insufficient"
 │ + hypothesis │   "Integer overflow possible in this type conversion"
 └──────┬──────┘
        │
        ▼
 ┌─────────────┐    Verify with actual build and execution
 │  Execution  │    AddressSanitizer, UBSan enabled
 │verification │    Generate reproducible PoC code
 │  + PoC      │
 └──────┬──────┘
        │
        ▼
 ┌─────────────┐    Separate agent reviews findings
 │ Adversarial │    "Does this PoC actually cause a crash?"
 │ Self-Review │    Filter out false positives
 └─────────────┘

File priority scoring combines static analysis with code comprehension. Files that directly manipulate memory, parse external input, or handle privilege boundaries score highest. The agent doesn’t process all 7,000 Firefox entry points evenly — it digs into high-risk files first.

Adversarial self-review is the key step that reduces false positives. A separate agent — not the one that found the vulnerability — reviews “is this PoC real?” The 112 reported memory safety bugs were all confirmed genuine. “Almost no false positives” is a direct result of this step.

AFL vs Mythos: What Is Structurally Different?

Dimension	Coverage-Guided Fuzzing (AFL/libFuzzer)	LLM Agent (Mythos)
Exploration method	Byte-level mutation → new edge coverage	Semantic code reasoning → hypothesis formation
Harness required?	Yes — manual, per-function	No — direct source code analysis
Discoverable bug types	Primarily memory safety (buffer overflow, use-after-free)	Memory safety + logic bugs, auth bypass, crypto misuse
Code understanding	None — only observes execution results	Yes — reasons about intent vs. implementation
Parallelization unit	Input generation (CPU-bound)	Code analysis tasks (agent instances)
False positive rate	Low (crash = reproducible)	Low (after adversarial self-review)
Cost	Cheap — large CPU clusters	High — repeated large LLM calls
Scalability	Easy horizontal scaling	Scalable via agent parallelism, but expensive
Representative tools	AFL++, libFuzzer, Honggfuzz, OSS-Fuzz	Anthropic Mythos, XBOW

The most important row in this table is “Discoverable bug types.” Coverage-guided fuzzing’s random exploration of input-processing paths works well for memory errors. But business logic bugs — “authentication tokens can be bypassed if called in a specific order” — are undetectable without understanding the code. That is the space Mythos has opened.

Firefox’s 271 Findings: What Was Found and What Was Missed?

The Firefox 150 analysis breaks down into three layers.

Layer 1 — Confirmed results (memory safety bugs)

112 vulnerabilities were triaged by external security researchers. All confirmed genuine. Of these, 10 were in the tier of “full remote code execution capable” bugs — compared to 1 found by Claude Opus 4.6 alone without Mythos.

Palo Alto Networks expressed it memorably: “1 year of pentest work in 3 weeks” — that was their result from running Mythos in a production environment experiment.

Layer 2 — Where debate remains (logic bug false positives)

SecurityWeek noted that “only 3 external CVEs were registered.” Most of the 271 didn’t result in external CVE assignments — which can be read two ways. First, logic bugs are frequently downgraded to “low exploitability” during triage even when genuine. Second, the LLM found something suspicious in the code but couldn’t build a viable attack chain. Adversarial self-review was effective for memory safety bugs, but judging the actual exploitability of logic bugs still relies heavily on human reviewers.

Layer 3 — Where humans are still needed

Mozilla’s explicit statement: “We don’t think patch automation is realistic.” Discovering a vulnerability and writing a patch are different skills. Vulnerability discovery is about finding suspicious patterns; patching requires understanding both the design intent of the affected code and its side effects, then fixing it with minimal change. At this point, Mythos’s role is “discovery through PoC” — engineers write the patches.

Where Capital Moved After Mythos

The startup funding market reacted immediately after Mythos’s demo went public. The previous investment thesis of “AI makes faster SAST” shifted to “AI does what existing tools structurally couldn’t.”

XBOW ($155M Series C, March 2026, unicorn)
CEO Oege de Moor is the co-founder of GitHub Copilot. XBOW builds LLM-based automated penetration testing agents. It goes one step beyond discovering vulnerabilities to automatically generating and validating complete attack chains. Where Mythos focuses on “finding vulnerabilities in code,” XBOW focuses on “validating exploits against live systems.”

ZeroPath ($5M Seed, February 2026)
Focuses on combining LLM with static analysis. Known for strength in authentication bypass pattern detection specifically. Preparing for Series A. Where Mythos is an internal tool from a large model provider, ZeroPath builds SaaS that enterprises can apply to their own codebases.

Pixee ($15M Seed)
Positioned differently. Focuses on “fixing” rather than “finding.” Builds the workflow from vulnerability discovery → automated patch suggestion → code review integration. Pixee is trying to fill the space Mozilla described when they said “patch automation isn’t realistic” — but from the enterprise tooling angle.

Summarizing the positioning in one line each:

Mythos: Discovery (Anthropic internal tool, partnership-based)
XBOW: Discovery + attack chain validation (enterprise SaaS)
ZeroPath: Discovery + CI/CD integration (enterprise SaaS)
Pixee: Remediation + workflow integration

Limitations and Open Questions

Is the cost sustainable?

Coverage-guided fuzzing incurs almost no marginal cost once a harness is written, even after running billions of times. You just need a CPU cluster. LLM agents incur cost per inference call. If parallel ephemeral VMs were running in the thousands during Firefox 150 analysis, what was the cost? Anthropic hasn’t disclosed it. The current cost structure is justifiable for large corporations or high-value targets, but it doesn’t yet work at the scale of OSS-Fuzz covering the entire open-source ecosystem.

The reproducibility problem

Coverage-guided fuzzing is deterministic — the same seed corpus guarantees the same execution. LLM agent reasoning is non-deterministic. Running the same codebase twice may surface different vulnerabilities. This also means “not found” doesn’t imply “not there.”

Google OSS-Fuzz’s LLM harness auto-generation

Google chose a different direction: using LLM to automatically generate fuzzing harnesses. The LLM reads the code and writes a harness; actual vulnerability discovery is still done by AFL/libFuzzer. This approach reduces the harness bottleneck while keeping the existing fuzzing infrastructure in place. A contrast to Mythos’s fully agentic approach.

Both use “AI that understands code,” but the usage differs:

Google: LLM → generate harness → AFL/libFuzzer executes
Mythos: LLM → directly hypothesize and validate vulnerabilities

Which is more effective remains open. The harness generation approach has lower cost and reuses existing toolchains. The Mythos approach opens up the logic bug territory that harnesses can’t cover.

Looking at where this field is heading, the two paradigms are more likely to settle as complements than competitors. AFL/libFuzzer covers “memory errors in executable input-handling paths” cheaply and at scale; LLM agents cover “logic and authentication bugs that require code understanding” at high cost. In practice, Mozilla’s approach was also a parallel deployment alongside existing fuzzing infrastructure, not Mythos alone.

From a security engineering perspective, the biggest change isn’t the tools — it’s the redefinition of expertise. If Mythos becomes mainstream, “ability to write harnesses well” matters less, and “designing and operating LLM agent pipelines” plus “triaging logic bugs” become the core competencies of a security engineer.

References

May 8, 2026 ∙ llm-security vulnerability-discovery fuzzing agentic-ai cybersecurity

Looking for a product partner? Founders, teams, businesses — from problem framing to launch.

Work With Me Get in touch →