
Sr. Technical Content Strategist

Maxime Lamothe-Brassard, founder and CEO of LimaCharlie, sought answers on AI’s current capabilities in the SecOps space. Plenty of benchmarks exist to test AI's knowledge of cybersecurity, but none test whether a model actually does the work.
There's a significant difference between an AI that can answer trivia questions about CVEs and one that can pick up an alert, investigate it, and produce an incident report.That gap matters more now than ever.
Security teams have been slow to adopt AI due to concerns about reliability and data exposure. Adversaries have no such hesitation. They're already using AI to accelerate attacks, and the industry will have to follow suit or fall behind.
Maxime knew practitioners evaluating agentic AI security needed an honest baseline to work from, not vendor marketing or unverifiable third-party claims. They need to know what frontier models can actually do when handed real security tooling and an alert.
So he built a way to test them.
Maxime’s project, ASW-Bench (Agentic SecOps Workspace Benchmark), is an open-source framework designed around a single guiding principle: test the LLM’s capabilities as-is, with as little customization as possible.
Two things were central to the design. First, a neutral, reproducible environment. Maxime chose LimaCharlie as the underlying platform. It provides a complete, API-first set of security operations capabilities without the complexity of stitching together multiple moving-target open-source tools. Every behavior is observable and documented. There's no black box.
Second, the models had to run with as little assistance as possible. No fine-tuning, no custom prompt chains, no proprietary middleware sitting between the model and the tools. Just a model, a prompt, and access to LimaCharlie's CLI.
That matters because many AI SOC vendors test inside heavily engineered setups, making it difficult to know how much performance comes from the underlying model versus the custom layer on top.
Three frontier model CLIs were used: Claude Code (Anthropic), Codex CLI (OpenAI), and Gemini CLI (Google).
His first testing scenario mirrors one of the most common real-world SOC tasks: a detection fires, and an analyst needs to investigate it, determine scope, and write it up. For this task adversary scripts execute a realistic post-exploitation attack chain including:
C2 beaconing
Credential theft
Lateral movement
Persistence mechanisms
Defense evasion
DNS exfiltration
The model's job is to find as much of the attack as possible.
All four models were tested against the same environment with no tuning. These are baseline scores.

Every model correctly identified the malicious beacon and C2 channel, the most visible indicators in the scenario. From there, the gap between Claude and the others widened. Both Claude Opus and Claude Sonnet identified credential theft (LSASS and SAM/SYSTEM dumps), lateral movement (ARP/ping sweep and SMB enumeration), and event log clearing. They produced comprehensive attack narratives with full MITRE ATT&CK mappings.
Gemini produced a coherent multi-phase narrative with remediation steps but missed credential access and lateral movement. Codex identified WMI persistence behavior and SMB reconnaissance but also missed credential access and event log clearing.
No model discovered the DNS exfiltration or data staging activity, which is an interesting finding in its own right. Further testing is needed to determine whether this was due to the nature of the task or constraints of the test environment.
These scores represent models operating out of the box. In real-world agentic security deployments, models would be tuned, given organizational context, and integrated with platform-specific knowledge.
It's reasonable to expect that with even modest tuning these scores would climb significantly. The baseline results alone send a strong signal of where agentic SecOps automation is headed.
For MSSPs and security teams evaluating where agentic security fits into their operations, ASW-Bench offers something rare: an honest, reproducible, vendor-neutral look at current AI capability on real SOC work.
The ASW-Bench project is open source and welcomes community contributions.
Explore the full results, raw output logs, and scenario at github.com/refractionPOINT/asw-bench.
Learn more about LimaCharlie's agentic SecOps platform at limacharlie.com.
440 N Barranca Ave #5258
Covina, CA 91723
5307 Victoria Drive #566
Vancouver, BC V5P 3V6
Stay up-to-date on all things LimaCharlie with our monthly newsletter.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.