• Thread Author
Hallucinations generated by language models pose one of the most formidable challenges in the modern AI landscape, especially as real-world applications increasingly depend on multi-step workflows and layered generative interactions. Microsoft’s introduction of VeriTrail marks a significant step forward in closed-domain hallucination detection, not merely by identifying unsupported model outputs, but by pioneering full traceability throughout the entire generative process. This innovative approach promises to empower enterprises, researchers, and developers to build trustworthy AI systems that can be readily verified, debugged, and improved—even when underlying workflows are complex and deeply multi-layered.

A complex, interconnected digital data network diagram with nodes and links on a blue background.Background: The Rise of Multi-Step Language Model Workflows​

Traditional language model (LM) applications—answering questions, summarizing documents, or drafting proposals—have typically focused on generating a single output from a fixed source. Detecting so-called closed-domain hallucinations (outputs not directly supported by inputs) was, in this setting, a straightforward comparison of the final output against the reference material.
However, modern LLM-powered systems rarely operate in such isolation. Increasingly, they mirror the architecture of advanced agentic workflows—such as GraphRAG or hierarchical summarization systems—where:
  • The LM creates multiple intermediate outputs,
  • Each stage may use earlier outputs as its input,
  • The process forms a directed acyclic graph (DAG) culminating with a final synthesis.
In this paradigm, hallucination does not merely threaten the output’s fidelity. It becomes a systemic risk—potentially cascading error through each generative layer, amplifying mistakes or obscuring the provenance and logic behind every result. Traditional “single output” hallucination detectors, which lack visibility into intermediate steps, are thus rendered insufficient for these modern workflows.

Introducing VeriTrail: Traceable Hallucination Detection​

VeriTrail was developed to address precisely this evolving landscape: detecting hallucinations in complex, multi-step AI workflows while providing full traceability and provenance for every claim present in the model’s output.

Two Pillars of Traceability​

VeriTrail’s innovation rests on two mutually reinforcing traceability principles:
  • Provenance: Any content in the final output should be traceable, step by step, through all intermediate representations, back to the original source materials.
  • Error Localization: When hallucination is detected in the final output, VeriTrail can identify the precise stage—or even node—where unsupported content or error first entered the pipeline.
By extending beyond mere final output verification, VeriTrail enables unprecedented accuracy in detecting, analyzing, and ultimately preventing the propagation of spurious information.

How VeriTrail Works: Inside the Detection Pipeline​

Modelling Generative Processes as DAGs​

VeriTrail encodes the full generative process as a directed acyclic graph (DAG):
  • Nodes represent units of text: source material, intermediate outputs, or the final answer.
  • Edges indicate explicit input-output relationships, mapping how each output is derived.
Each node is assigned a unique ID and a stage, reflecting its position within the generative pipeline. This structure is flexible enough to accommodate arbitrarily complex workflows—be it GraphRAG, hierarchical summarization, or advanced agentic compositions.

Stepwise Claim Verification​

The VeriTrail pipeline executes in the following sequential steps:
  • Claim Extraction
    Using the Claimify tool, VeriTrail identifies claims—self-contained, verifiable statements—in the final output.
  • Reverse-Order Verification
    For each claim, VeriTrail traces backward through the DAG, starting from the final output and moving toward the source.
  • Node and Evidence Selection
    At every node in the trace:
  • The system splits node text into uniquely identified sentences.
  • An LM is prompted to select sentence IDs that strongly support or refute the claim.
  • Summaries of selected sentences are optionally generated.
  • Verdict Generation
    The LM issues a verdict for each iteration:
  • Fully Supported
  • Not Fully Supported
  • Inconclusive
If supporting evidence is found, iteration continues to previous nodes. If not, or if consecutive "Not Fully Supported" verdicts reach a user-defined threshold, verification stops and the final verdict is registered.

Evidence Trail and Output​

VeriTrail’s results for each claim consist of:
  • The chain of verdicts issued at each step,
  • An evidence trail: sentence IDs, their originating node IDs, and corresponding summaries.
This trail affords users a practical way to audit, verify, and comprehend complex workflows without manually parsing massive intermediate outputs.

Case Study: GraphRAG Workflow in Practice​

To clarify VeriTrail’s process, consider the example of GraphRAG, a multi-stage information extraction pipeline, as follows:

GraphRAG Stages​

  • Chunking: Split source text into segments.
  • Entity/Relationship Extraction: LM processes each chunk to extract entities, relationships, and descriptions.
  • Summarization: If an item appears in multiple chunks, summaries are generated to consolidate redundant information.
  • Knowledge Graph Building: Construction of a system-wide knowledge graph; community detection organizes entities into communities.
  • Community Reporting: LM generates high-level reports for each community.
  • Answer Synthesis: For end-user queries, map-level answers are synthesized into a final response.

Tracing “Fully Supported” Claims​

Suppose a claim in the final answer is valid and not hallucinated:
  • VeriTrail begins at the final node, examining all immediate input nodes for sentences supporting the claim.
  • If evidence is found and the verdict is "Fully Supported," it moves to upstream input nodes.
  • This continues until the process reaches the original source text, terminating with a final "Fully Supported" verdict.

Tracing Hallucinations: “Not Fully Supported” Claims​

When a claim cannot be substantiated by available evidence:
  • VeriTrail’s configuration allows setting a threshold of consecutive "Not Fully Supported" verdicts before termination (for example, two in succession).
  • The tool broadens the search to all possible input nodes (not just those selected in earlier rounds), ensuring that overlooked evidence doesn’t produce false positives.
  • Once the configured threshold is hit with no supporting evidence, a "Not Fully Supported" verdict is determined, and the error stage is identified—pinpointing exactly where unsupported content likely arose.

Key Features and Optimizations​

Ensuring Evidence Integrity​

A core concern is evidence hallucination by the model itself. VeriTrail mitigates this by:
  • Assigning sentence IDs programmatically,
  • Discarding any IDs returned by the LM that don’t map to actual sentences in the nodes under review,
  • Mapping only verified sentence IDs as legitimate evidence points.
This rigorous scheme guarantees the authenticity of the evidence trail.

Targeted Evaluation for Scalability​

To address efficiency and scalability, VeriTrail employs smart heuristics:
  • For “Fully Supported” or “Inconclusive” claims, only input nodes with previously selected evidence undergo further verification, narrowing the search space as the system approaches the original source.
  • Because large workflows often result in nodes of dramatically varying size, this focused strategy limits computational cost—an essential feature for enterprise-scale deployments.

Flexible Graph Handling​

VeriTrail is architected to manage arbitrarily large DAGs, regardless of their ability to fit within a single LLM prompt:
  • Prompts are automatically batched and split as needed,
  • Evidence selection and verdict steps are iteratively rerun if input constraints are breached,
  • Users can tune the number of reruns and the strictness of the “Not Fully Supported” threshold to balance performance and cost.

Comparative Evaluation: VeriTrail Versus Baseline Approaches​

Robust benchmarking of VeriTrail involved two highly distinct use cases:
  • Hierarchical Summarization (with tasks like narrative distillation across fiction or news article collections),
  • GraphRAG Question-Answering (operating on large, diverse documents).
The distinguishing characteristic for both was scale: source material in excess of 100,000 tokens, and real-world DAGs with an average of over 100,000 nodes. Such complexity would overwhelm conventional hallucination detectors.

Benchmark Methods Compared​

Baselines included:
  • Natural Language Inference Models (e.g., AlignScore, INFUSE),
  • Retrieval-Augmented Generation (RAG) mechanisms,
  • Long-context language models (e.g., Gemini 1.5 Pro, GPT-4.1 mini).
While these methods suffice for single-output verification, they fail to provide claim-level traceability through intermediate representations.

Results​

VeriTrail consistently outperformed all baselines in hallucination detection across datasets and LMs, with only one minor exception (where the model attained highest balanced accuracy but not the peak macro F1 score). Crucially, VeriTrail’s ability to identify where hallucination enters a workflow, and how valid outputs are derived, sets it apart—bringing an unprecedented level of transparency to multi-step AI applications.

Traceability as a Cornerstone for Trust and Debugging​

By returning not just verdicts but the entire evidence trail, VeriTrail enables several groundbreaking advantages:
  • Human-Auditable Evidence: Users need review only select sentences and LM-generated summaries, not the entire DAG, streamlining auditing and error analysis.
  • Pinpoint Error Localization: For unsupported claims, the exact generative stage where hallucination originated is highlighted, enabling targeted debugging and model improvement.
  • Provenance Tracking: Users confidently chart how each claim in the final output relates to the original source, a critical capability for regulated industries or high-stakes automation.

Risks, Limitations, and Future Outlook​

No system is flawless, and VeriTrail’s deployment currently targets research rather than production settings. Key limitations and challenges include:
  • Model Reliability: The trustworthiness of Verdict Generation and Evidence Selection hinges on the underlying LLM’s comprehension, absence of bias, and susceptibility to prompt injection or adversarial data.
  • Scalability Limits: While designed for massive graphs, very large workflows could still strain system resources or incur significant cost.
  • Human Oversight Required: As with any detection system, final judgment on borderline or highly ambiguous claims may require expert review.
Nevertheless, the architecture is a foundational leap for the community—transforming hallucination detection from a static check to a dynamic, traceable, auditable process.

Conclusion: VeriTrail Sets a New Standard for Transparent and Trustworthy AI​

With multi-step, agentic workflows now at the heart of next-generation AI deployments, the need for traceable, auditable hallucination detection has never been more urgent. VeriTrail delivers on this imperative—tracing every claim from final output through every transformational node back to the source, highlighting error propagation, and equipping practitioners with actionable tools for verification and improvement.
By integrating deep traceability into the core of generative workflow evaluation, Microsoft’s VeriTrail does more than detect hallucinations—it creates the groundwork for explainable and trustworthy AI. As the community presses onward toward ever more capable and complex language models, such innovations will prove pivotal in bridging the gap between raw generative power and the assurances demanded for real-world deployment.

Source: Microsoft VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows
 

Back
Top