March 10th, 2026

What Frontier Models Can Actually Do in a SOC: Open-source Benchmark for Agentic SecOps Capabilities

Daniel Ballmer

Sr. Technical Content Strategist

The benchmark gap that needed filling

Maxime Lamothe-Brassard, founder and CEO of LimaCharlie, sought answers on AI’s current capabilities in the SecOps space. Plenty of benchmarks exist to test AI's knowledge of cybersecurity, but none test whether a model actually does the work.

There's a significant difference between an AI that can answer trivia questions about CVEs and one that can pick up an alert, investigate it, and produce an incident report.That gap matters more now than ever.

Security teams have been slow to adopt AI due to concerns about reliability and data exposure. Adversaries have no such hesitation. They're already using AI to accelerate attacks, and the industry will have to follow suit or fall behind.

Maxime knew practitioners evaluating agentic AI security needed an honest baseline to work from, not vendor marketing or unverifiable third-party claims. They need to know what frontier models can actually do when handed real security tooling and an alert.

So he built a way to test them.

Methodology: real tools, minimal scaffolding

Maxime’s project, ASW-Bench (Agentic SecOps Workspace Benchmark), is an open-source framework designed around a single guiding principle: test the LLM’s capabilities as-is, with as little customization as possible.

Two things were central to the design. First, a neutral, reproducible environment. Maxime chose LimaCharlie as the underlying platform. It provides a complete, API-first set of security operations capabilities without the complexity of stitching together multiple moving-target open-source tools. Every behavior is observable and documented. There's no black box.

Second, the models had to run with as little assistance as possible. No fine-tuning, no custom prompt chains, no proprietary middleware sitting between the model and the tools. Just a model, a prompt, and access to LimaCharlie's CLI.

That matters because many AI SOC vendors test inside heavily engineered setups, making it difficult to know how much performance comes from the underlying model versus the custom layer on top.

Three frontier model CLIs were used: Claude Code (Anthropic), Codex CLI (OpenAI), and Gemini CLI (Google).

His first testing scenario mirrors one of the most common real-world SOC tasks: a detection fires, and an analyst needs to investigate it, determine scope, and write it up. For this task adversary scripts execute a realistic post-exploitation attack chain including:

C2 beaconing
Credential theft
Lateral movement
Persistence mechanisms
Defense evasion
DNS exfiltration

The model's job is to find as much of the attack as possible.

Results: Claude leads, all models show promise

All four models were tested against the same environment with no tuning. These are baseline scores.

Every model correctly identified the malicious beacon and C2 channel, the most visible indicators in the scenario. From there, the gap between Claude and the others widened. Both Claude Opus and Claude Sonnet identified credential theft (LSASS and SAM/SYSTEM dumps), lateral movement (ARP/ping sweep and SMB enumeration), and event log clearing. They produced comprehensive attack narratives with full MITRE ATT&CK mappings.

Gemini produced a coherent multi-phase narrative with remediation steps but missed credential access and lateral movement. Codex identified WMI persistence behavior and SMB reconnaissance but also missed credential access and event log clearing.

No model discovered the DNS exfiltration or data staging activity, which is an interesting finding in its own right. Further testing is needed to determine whether this was due to the nature of the task or constraints of the test environment.

These scores represent models operating out of the box. In real-world agentic security deployments, models would be tuned, given organizational context, and integrated with platform-specific knowledge.

It's reasonable to expect that with even modest tuning these scores would climb significantly. The baseline results alone send a strong signal of where agentic SecOps automation is headed.

What this means for security teams

For MSSPs and security teams evaluating where agentic security fits into their operations, ASW-Bench offers something rare: an honest, reproducible, vendor-neutral look at current AI capability on real SOC work.

The ASW-Bench project is open source and welcomes community contributions.

Explore the full results, raw output logs, and scenario at github.com/refractionPOINT/asw-bench.

Learn more about LimaCharlie's agentic SecOps platform at limacharlie.com.

United States

440 N Barranca Ave #5258
Covina, CA 91723

Canada

5307 Victoria Drive #566
Vancouver, BC V5P 3V6

Stay up-to-date on all things LimaCharlie with our monthly newsletter.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

use cases

Use Case Catalog

EDR / XDR

SIEM / Log Analytics

Incident Response

SOAR / Automation

SecOps Engineering

Cloud Security

Products

Agentic SecOps Workspace

Grid

Viberails

resources

Blog

Case Studies

Data Sheets

Webinars

Events

Community

Podcast

Defender Fridays

Documentation

Solutions

MSSPs

Enterprise

Builders

Company

About Us

Careers

News & Press

Ask AI about LimaCharlie

Status

Trust