ExCyTIn Bench: Open Source Agentic AI Benchmark for Real SOC Investigations

  • Thread Author
Microsoft’s security team has open‑sourced ExCyTIn‑Bench, a new benchmarking framework designed to evaluate how well large language models and agentic AI systems perform real‑world cyber threat investigations inside a simulated Security Operations Center (SOC) — and it changes the rules for how defenders and vendors measure “useful” AI in security.

A futuristic security operations center with a holographic figure monitoring incidents.Background​

Microsoft developed ExCyTIn‑Bench to move LLM evaluation beyond trivia and static Q&A into the workflows that SOC analysts actually run every day. The benchmark simulates multistage attack scenarios in an Azure‑based SOC and gives AI agents live access to a structured security datastore that mirrors Microsoft Sentinel and related telemetry. That store is built from 57 log tables and drives 589 automatically generated question‑answer pairs tied to incident graphs that represent real investigative chains of evidence. The dataset and methodology are described in the ExCyTIn‑Bench paper and accompanying dataset release; Microsoft’s security blog and a hosted dataset card confirm the scope and intent.
ExCyTIn departs from older evaluation approaches in two important ways. First, instead of asking a model a single, static question about an artifact (for example "what malware family is this?"), it measures planning and multistep investigative behavior: an agent must decide which queries to run, how to pivot across tables, and how to synthesize evidence to reach conclusions. Second, ExCyTIn anchors every question and answer to a node or edge in an incident graph — typically a bipartite alert–entity graph — enabling a fine‑grained reward signal for each investigative action rather than a single pass/fail score. Microsoft says it uses the framework internally to evaluate Security Copilot, Sentinel integrations and Defender features, and that the benchmark is now open for community use.

What ExCyTIn‑Bench actually contains​

Technical composition​

  • A controlled Azure tenant with simulated attacks covering a set of multistage incidents (the academic paper describes eight attack scenarios).
  • 57 log tables taken from Microsoft Sentinel‑style telemetry and related services to recreate the scale, noise, and heterogeneity SOC analysts face.
  • 589 question–answer pairs automatically generated from the investigation graphs; each pair ties a start node (context) to an end node (ground‑truth answer) so model responses can be objectively scored.
  • A dataset and transparency documentation published on Hugging Face that enumerates data volume (logs totaling about 20.4 GB) and provenance (synthetic logs collected and Q&A generated on specific dates). The dataset card and transparency note explicitly state the data is synthetic and intended for research use.

How evaluation works​

  • The benchmark runs agents inside an environment where they can issue queries to the simulated tenant, inspect returned records, and perform multi‑hop reasoning across different tables. Each agent action (for example: run a KQL query, follow an IP to a host record, or flag an alert) receives a stepwise reward; final scoring aggregates those rewards to measure investigative strategy and execution, not just terminal correctness. This structure is intended to reveal how a model reached a conclusion and where it failed during the investigation.

Why this matters: strengths and improvements over existing benchmarks​

Realism and operational fidelity​

ExCyTIn‑Bench is built to reflect real SOC pressure points: noisy logs, many heterogeneous tables, chained evidence, and the need to plan an investigation under cost and time constraints. That’s a major step up from chat‑style benchmarks and knowledge tests that measure recall rather than workflow competence. By requiring an agent to act in a simulated environment, the benchmark tests tool use, query formulation, and data navigation — the same skills human analysts use day to day.

Explainability and fine‑grained feedback​

Because every question is anchored to nodes and edges in an incident graph, ExCyTIn provides explainable ground truth and per‑action rewards. This gives security teams actionable diagnostics: did the model select the wrong pivot? Did it miss a crucial table? Which step in the logic chain failed? That kind of granularity is useful for model debugging, operator trust, and targeted fine‑tuning.

Open source and community participation​

Microsoft’s release of the dataset and framework (published materials and a dataset card on Hugging Face) invites research groups, vendors and customers to benchmark their agents under a common methodology. Open benchmarks typically accelerate progress by establishing a shared evaluation standard and exposing failure modes for collective analysis. Microsoft positions ExCyTIn as complementary to — not a replacement for — other efforts such as CyberSOCEval, while emphasizing agentic, live‑data evaluation.

Cross‑checks and independent validation​

Two independent technical artefacts confirm ExCyTIn’s core claims:
  • The peer‑accessible preprint (arXiv) documents the dataset construction, evaluation pipeline, and key metrics (e.g., base model average reward 0.249, top model 0.368 in the paper’s experiments), and it details the 57 tables, eight simulated attacks and 589 Q&A pairs. This is the primary technical specification.
  • A public dataset card and transparency note hosted on Hugging Face publishes the data inventory, the size of logs (≈20.4 GB), and usage guidance; it explicitly states the logs are synthetic and gives dates for log collection and Q&A generation. These disclosures align with the paper and Microsoft’s blog post.
For benchmarking context, a separate industry effort — CyberSOCEval, recently introduced by CrowdStrike and Meta — targets related SOC tasks (malware analysis and threat intelligence reasoning). CyberSOCEval is complementary but differs in design choices and content focus. That effort’s public announcement and arXiv summary provide a useful comparison when evaluating where ExCyTIn’s agentic approach fills gaps in the benchmarking landscape.

Critical analysis: strengths, caveats, and risks​

Strengths​

  • Operational alignment: ExCyTIn measures the sequence of decisions an agent makes, which is exactly what matters in incident response — how the model investigates, not just whether its final report superficially matches ground truth. This alignment makes ExCyTIn far more relevant to CISOs and SOC leads than purely static benchmarks.
  • Reproducible, explainable scoring: Anchoring questions to graph nodes/edges creates verifiable ground truth and reduces ambiguity in scoring. Stepwise rewards enable targeted improvements and more meaningful vendor comparisons.
  • Open, extensible design: The dataset and framework being open encourages reproducibility and independent audits, and gives practitioners a starting point for in‑house evaluations. Hugging Face transparency docs also encourage responsible research use by clarifying the synthetic nature of the data.

Caveats and limitations​

  • Synthetic data vs. live telemetry: ExCyTIn’s dataset is synthetic (designed to replicate real telemetry patterns), which is the responsible choice for public release. However, performance on synthetic logs is not necessarily predictive of performance on a specific customer’s telemetry distribution, schemas, or threat profile. Microsoft notes future options for customer‑tenant‑level tailoring; doing this safely will require rigorous controls and data governance to avoid exposing PII or operational secrets.
  • Model performance claims are company‑reported: Metrics Microsoft cites (for example, comparative “average reward” numbers and claims about GPT‑5 performance) originate from benchmark runs the company controlled. Those numbers are useful but should be interpreted as Microsoft‑reported results until independently reproduced by third parties. External benchmarking labs and academic groups will need to validate those figures on neutral infrastructure.
  • Scenario coverage and threat diversity: The academic paper describes eight simulated multi‑step attacks. While those are valuable, SOCs face a far broader and evolving threat landscape. Benchmarks inevitably over‑ or underrepresent specific TTPs; high performance on a benchmark does not guarantee robustness to novel adversary techniques or highly instrumented enterprise environments.
  • Adversarial risk and misuse: Benchmarks that reveal how agents pivot across data sources can be double‑edged. Details about queries, pivot logic, and investigative strategies could inform adversaries trying to evade automated detection or weaponize agentic approaches. Responsible disclosure practices and red‑team evaluation are essential when community members use the framework.
  • Evaluation overhead and cost: Microsoft highlights monitoring model performance and cost for product integrations. Running agentic evaluations at scale — especially when tethered to cloud telemetry — consumes compute and storage and may generate non‑trivial Azure costs. Organizations must factor operational cost into any plan to adopt agentic security automation.

Practical implications for CISOs and IT leaders​

Security leaders should treat ExCyTIn‑Bench as a practical tool to augment — not replace — human capability and existing validation pipelines. It is best used as part of a layered evaluation strategy that includes:
  • Controlled benchmarking with synthetic datasets such as ExCyTIn to validate high‑level agentic behavior.
  • Pilot testing in a quarantined customer tenant or dev environment using red team scenarios that mirror real workloads and data formats (with strict privacy controls).
  • Ongoing operational metrics: false positive/negative rates, time‑to‑contain improvement, analyst acceptance and workflow fit, and cost tracking for AI inference.
Key considerations when evaluating vendors or internal builds:
  • Demand explainable per‑action scoring and ask for audit trails showing how an agent reached a decision. ExCyTIn’s stepwise rewards make this request reasonable.
  • Require independent reproduction of vendor benchmark claims in a neutral environment before trusting headline scores. Vendor dashboards are useful, but independent validation reduces the risk of overfitting to a benchmark.
  • Assess governance and vendor commitments around data handling, retention, and model fallbacks — particularly if the vendor offers tenant‑level benchmark tailoring. Customization that touches live tenant data introduces privacy and compliance responsibilities.

The competitive context: where ExCyTIn fits among other benchmarks​

Industry benchmarking activity has accelerated. CrowdStrike and Meta’s CyberSOCEval suite focuses on malware analysis and threat intelligence reasoning, offering another open resource to measure model capabilities in complementary SOC tasks. CyberSOCEval emphasizes realistic artifacts like sandbox detonation outputs and threat‑intelligence reasoning scenarios, while ExCyTIn targets agentic, multi‑hop investigations across noisy log stores. Together, these benchmarks give security teams multiple angles for assessing AI tools: tactical artifact analysis and workflow‑level investigative competence.
Academic benchmark suites such as CyberSecEval and prior CyberSecEval‑derived efforts also explored aspects of cyber reasoning and safety. ExCyTIn’s unique contribution is its explicit agentic environment and per‑action reward framing, which is a valuable addition to the broader benchmarking ecosystem.

Recommended next steps for practitioners​

  • Download the ExCyTIn dataset card and transparency documentation to understand what is synthetic and to replicate the environment on isolated infrastructure. Treat the Hugging Face release as the starting point for reproducible testing.
  • Run baseline tests with your current automation and analyst workflows to establish a performance and cost baseline before introducing LLM agent experiments. Use ExCyTIn’s per‑step scoring to identify where your toolchain (parsers, query builders, connectors) breaks down.
  • If planning tenant‑level tailoring or pilot deployments, craft privacy and legal reviews that document what telemetry may be used and how models, queries and outputs will be logged, retained, and protected. Design a clear rollback/fallback process when agents provide uncertain or dangerous recommendations.
  • Engage in cross‑industry validation: collaborate with peers and academic partners to reproduce vendor claims and share anonymized failure modes so the community can harden both benchmarks and production systems.

Governance, safety and commercial considerations​

ExCyTIn’s release comes at a time when vendors and cloud providers are rapidly adding third‑party models to productivity and security platforms. Microsoft began supporting Anthropic’s Claude models in Copilot Studio and Microsoft 365 Copilot in late September 2025, explicitly offering customers model choice for reasoning and orchestration tasks; that move underscores the need for consistent benchmarks that measure how different underlying models behave in security contexts. But model choice also introduces complexity: different providers have different hosting, terms, and compliance implications that affect enterprise risk.
From a procurement perspective, ExCyTIn helps translate abstract model comparisons into operational outcomes — but buyers must budget for the real cost of agentic evaluation (inference, storage, and analyst time), and insist on independent verification of performance and safety claims.

Conclusion​

ExCyTIn‑Bench is a meaningful, practical advance in the evaluation of AI for security operations: it abandons toy questions and static evidence in favour of agentic, stepwise investigation in noisy, multitable environments. That design better matches the work SOCs actually do, and it delivers the kind of explainable, actionable feedback security teams need to judge AI products. The framework’s openness and its dataset documentation give researchers and buyers the tools to reproduce results and to iterate on models and operator workflows.
At the same time, this release highlights persistent gaps: synthetic data does not fully stand in for production telemetry, vendor‑reported headline metrics require independent reproduction, and tenant‑level customization raises privacy and governance questions that must be addressed before the benchmark’s agentic approach is applied against live customer data. Security leaders, procurement teams, and researchers should therefore treat ExCyTIn as a powerful component of a broader, defense‑in‑depth evaluation program — one that combines neutral benchmarking, controlled pilots, cost analysis, and rigorous governance to safely extract real operational value from AI in security.


Source: verdict.co.uk Microsoft launches open-source tool to assess AI performance
 

Back
Top