Actions Speak Louder Than Prompts: LLMs Graph Inference with Graph-as-Code

  • Thread Author
Microsoft Research’s new large-scale study reframes a simple but powerful idea: when LLMs work over graph-structured data, how they are allowed to act matters at least as much as what prompts you feed them.

Background​

Graph-structured data underpins many modern productivity and enterprise systems. A shared document in an organization, for example, is not just a blob of text — it’s a node embedded in a network of collaborators, folders, teams, and related artifacts. These relationships are essential when answering practical questions like whether a document is sensitive, which files to surface in a colleague’s feed, or whether a sharing pattern looks anomalous. Traditional approaches for these tasks use Graph Neural Networks (GNNs) and m-passing techniques, but reality increasingly includes long-form text attributes, heterogeneous labels, and dense connectivity that strain classical pipelines.
Microsoft Research’s study, “Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference,” presents a systematic, controlled exploration of how modern large language models perform on node classification problems when they are given different kinds of access and agency over graphs. The paper evaluates three distinct interaction paradigms — Prompting, GraphTool (a ReAct-style tool-use loop), and Graph-as-Code (where the model generates and executes short programs against a structured API) — across 14 datasets spanning citation, web-link, e-commerce, and social networks. The paper was published on Microsoft’s research site on March 3, 2026 and is available on arXiv.

Why this matters now​

  • Real systems are text-rich and densely connected. In enterprise collaboration, e-commerce catalogs, and fraud networks, node attributes are often long descriptions or full documents, and nodes can have hundreds of neighbors. Packing those neighbors into a single prompt quickly hits practical token limits or forces lossy sampling.
  • LLMs are evolving beyond static prompt-response services toward agentic modes that call tools, query APIs, or generate and execute code. This shift changes the design space: you no longer only optimize prompts — you design interaction protocols. The broader Microsoft ecosystem is moving in this direction as well, with graph-context integrations in productivity features and early work on agentic, tool-capable pipelines.
  • Practitioners who build intelligent features on top of knowledge graphs and content graphs need guidance on trade-offs between accuracy, latency, cost, and security when choosing an LLM interaction mode. This paper provides those practical signals.

Overview of the study: scope, datasets, and interaction modes​

Scope and datasets​

The study evaluates LLM-based graph inference across:
  • 14 benchmark datasets spanning four domains: citation networks, web-link graphs, e-commerce product graphs, and social networks.
  • Structural regimes that include both homophilic graphs (neighbors tend to share labels) and heterophilic graphs (neighbors tend to have different labels).
  • Feature regimes with both short-text node attributes (titles, short names) and long-text node attributes (full descriptions, document bodies).
  • Multiple LLM sizes and reasoning-enabled variants to probe scaling behavior.
This breadth is important: it moves beyond cherry-picked toy graphs and short-text benchmarks to settings that resemble real-world enterprise and consumer systems.

Interaction paradigms compared​

  • Prompting — serialize a node’s k-hop neighborhood and features into text, include label taxonomy, and ask the model to classify in one shot. This is the default mindset for many LLM applications.
  • GraphTool — a ReAct-inspired loop where the model alternates between reasoning and a fixed set of graph actions (retrieve neighbors, fetch features, query labels) and decides the next action iteratively.
  • Graph-as-Code — the LLM writes short programs (for example, simple Python/pandas snippets) that run against a typed graph API or table, letting the model compose arbitrary queries and computations over structure and text.
These strategies form a spectrum of agency: prompting is passive and bounded by context size; GraphTool gives limited interactive agency; Graph-as-Code grants flexible, programmable agency. The paper’s central claim is that more agency enables better accuracy and adaptability, especially in dense, long-text, or high-degree settings.

Key findings — what the experiments show​

1) Graph-as-Code consistently leads the pack​

Across the evaluation suite, Graph-as-Code achieves the strongest overall performance. Its advantage grows in datasets with long node texts and high-degree nodes because it sidesteps token-budget limitations by selectively retrieving and composing only the most relevant information. In short-text, low-degree graphs the gap narrows and prompting can be competitive, but the general trend favors code-generation interfaces for real-world, text-rich graphs.

2) Agentic interaction offers robustness and adaptability​

When the study performs controlled ablations — deleting edges, truncating text, or removing labels — Graph-as-Code demonstrates an ability to adapt its reliance on different information sources. On homophilic graphs it leans on structure; on heterophilic graphs it emphasizes textual features. Prompting degrades predictably when edges or labels are removed, because the single-shot prompt cannot re-plan a different information-acquisition strategy. This emergent adaptability of code-generation over graphs is a fundamental result.

3) Heterophily is not a deal-breaker for LLM-based methods​

A common belief in the LLM-graph literature was that LLMs would fail on heterophilic graphs because neighbor labels are misleading. However, the study finds that all three interaction strategies perform well on several heterophilic benchmarks, and LLMs can extract signal from non-local patterns and rich features rather than relying solely on local neighbor voting. This broadens the applicability of LLM-based reasoning to messy, cross-cutting organizational graphs.

4) Scaling and reasoning capabilities still help — but interaction design matters independently​

Larger and reasoning-enabled models improve performance across the board, but the relative advantage of Graph-as-Code persists at every model scale tested. In other words, interaction design is an axis of improvement that complements, rather than substitutes for, model scaling.

A closer look at Graph-as-Code: why code helps LLMs reason over graphs​

Graph-as-Code gives the LLM a typed view of the graph (for example, a table with node_id, features, neighbors, label) and allows it to generate small programs that:
  • Selectively query neighbors (e.g., fetch top-K by recency or similarity).
  • Apply textual processing (e.g., compute similarity scores, extract named entities).
  • Compose structural operations (e.g., compute two-hop aggregates, conditional traversals).
  • Execute label-propagation heuristics or ensemble logic tailored to the node.
This flexibility buys three concrete advantages:
  • Token economy: Instead of serializing hundreds of long neighbor texts into a prompt, the model retrieves only what it needs and runs local computations, keeping the model context compact.
  • Compositional strategies: The model can combine text processing and graph traversals in nontrivial ways (e.g., filter neighbors by organization then compute a TF-IDF similarity to the target node).
  • Emergent algorithm discovery: The LLM can invent ad hoc heuristics for different nodes (e.g., when neighbor labels are noisy, rely on the top two most similar neighbor texts rather than majority vote).
These properties make Graph-as-Code especially attractive for dense, real-world graphs such as collaborative content graphs, product graphs with long descriptions, and fraud networks with complex cross-cutting links.

Practical design principles distilled from the study​

The authors translate their findings into clear guidance for engineers and architects building LLM-graph systems:
  • Match the interaction mode to graph characteristics. Use Prompting for small, sparse graphs with short text; prefer Graph-as-Code as graphs grow denser or texts get longer.
  • Don’t dismiss LLMs on heterophilic graphs — test interaction modes before assuming failure.
  • Think beyond prompt engineering. Invest in richer interfaces (tooling, APIs, code execution) that let models plan and act.
  • Evaluate adaptively. Run targeted ablations (remove edges, truncate text, drop labels) to understand which signals your model relies on, and harden accordingly.
These are not theoretical suggestions — they are directly grounded in experiments spanning diverse datasets and model families.

Critical analysis: strengths, trade-offs, and practical caveats​

Strengths and advances​

  • The paper’s controlled, broad evaluation is a major empirical contribution. It goes beyond single-benchmark claims and surfaces consistent patterns about interaction design across domains.
  • The demonstration that agency — not just model size or prompt quality — unlocks significant gains is timely. As organizations invest in agentic LLM deployments, this work gives a rigorous empirical foundation for building graph-aware agents.
  • The ablation studies are particularly valuable because they reveal how different modes shift reliance between structure, text, and label signals. That provides actionable diagnostics for system hardening and data-collection priorities.

Trade-offs and engineering costs​

  • Running generated code in production raises operational complexity. You must provision safe sandboxes, control resource usage, and audit program execution paths. These are nontrivial additions compared to simple prompt-based calls.
  • Latency and cost: selective retrieval plus on-the-fly computation can increase round-trip steps, so engineers must balance accuracy gains against inference latency and token/API costs.
  • Maintainability: letting a model generate arbitrary code means you need robust monitoring, deterministic fallbacks, and versioned APIs to avoid brittle, data-dependent behaviors.

Security, privacy, and governance risks​

  • Executing model-generated programs over internal graphs exposes attack surfaces: malicious or buggy code generation, prompt injection that causes undesired API calls, or leakage of sensitive attributes via logs. Rigorous sandboxing and least-privilege APIs are essential.
  • Data governance: models that adaptively fetch only some neighbors might inadvertently access data that violates policy constraints (e.g., cross-tenant signals, PII). Fine-grained access control and policy checks at the API layer are required.
  • Auditability and reproducibility: model-generated reasoning chains and code must be logged with context to meet compliance needs; reproducing a given prediction may need replayable retrieval traces and seeds.
These considerations imply that adopting Graph-as-Code in production is not a drop-in replacement for pr is a design shift that requires collaboration across ML, engineering, privacy, and security teams.

Implementation checklist: what teams should prepare before adopting Graph-as-Code​

  • Prepare a typed, queryable graph interface (e.g., a read-only API returning bounded neighbor lists and text features).
  • Build a safe execution environment with resource limits and deterministic logging for LLM-generated code.
  • Implement policy filters for retrieval (access controls, PII redaction) and for outputs (sanitization and post-hoc checks).
  • Provide model-level guardrails: tool sets the model can call, schema of allowable operations, and automated fallbacks if the model asks for disallowed actions.
  • Instrument extensive offline testing: ablations, adversarial prompts, and drift detection to understand how the model’s behavior changes as the graph or features evolve.
Following these steps can reduce deployment risk while preserving the accuracy and adaptability gains shown in the study.

Broader implications for enterprise AI product design​

The paper’s message — that actions speak louder than prompts — aligns with a growing industry trend: integrating LLMs as orchestrators that interact with structured data sources, not simply as text transformers. Microsoft’s product work (for example, graph-context features inside productivity tools and agentic integrations) reflects this shift toward agentic, data-aware assistants. For product teams, this implies:
  • Reimagining UX for AI features: instead of a single chat box, consider multimodal interfaces where AI can request clarifying actions, fetch documents, or run scoped analyses on demand.
  • Building data contracts and APIs that expose semantics safely (typed tables, sanitized feature views) rather than dumping raw text into prompts.
  • Investing in observability and human-in-the-loop review mechanisms for critical decisions (sensitive labeling, compliance flags).
These changes are foundational rather than incremental; they alter how teams conceive feature boundaries and model responsibilities.

What the paper does not solve (open challenges)​

  • Rigorous cost–benefit characterization at production scale. The study focuses on accuracy and controlled ablations; deploying Graph-as-Code at enterprise scale raises unanswered questions about throughput, caching strategies, and total cost of ownership.
  • Security-hardening patterns for model-generated code at scale. While the paper acknowledges the need for secure execution, concrete, battle-tested design patterns for safe code execution in adversarial settings remain an open engineering problem.
  • Cross-model reproducibility under changing graph dynamics. Graphs evolve; understanding how model-generated heuristics generalize over time and across distribution shift is an area ripe for follow-up research.
These gaps point to a research and engineering agenda that spans systems, security, and human oversight.

Actionable recommendations for practitioners (quick start)​

  • If your graph is small and node texts are short: prototype with Prompting to validate feasibility quickly.
  • If node text is long, neighbors are many, or labels are noisy: prioritize a Graph-as-Code prototype with a sandboxed execution layer.
  • Implement concise ablation tests (edge removal, text truncation, label hiding) to reveal signal dependencies before productionizing.
  • Add policy-enforced retrieval filters and execution guards before giving models any write-capable or sensitive-read actions.
  • Monitor for drift and log retrieval+execution traces for each decision to allow post-hoc audits and reproducibility.
These steps help teams get practical benefits without exposing themselves to undue operation or security risk.

Conclusion​

“Actions Speak Louder than Prompts” supplies more than an empirical result — it reframes how we should think about LLMs in graphed environments. The study’s clean experimental design and broad scope show that granting LLMs structured agency — the ability to query, compose, and execute over graphs — unlocks measurable accuracy and robustness gains, especially in the dense, long-text networks that characterize real-world systems. At the same time, the work is honest about trade-offs: operational complexity, governance, and security must be addressed when moving from prompts to code.
For teams building intelligent features on collaborative platforms, e-commerce catalogs, fraud-detection networks, or social systems, the practical takeaway is straightforward: invest in richer, controlled interaction interfaces that let models act—safely and audibly—on your data. The accuracy gains are not merely incremental; they represent a conceptual shift toward agentic, programmatic model behavior that better mirrors the demands of real-world graph reasoning.

Source: Microsoft https://www.microsoft.com/en-us/res...s-rethinking-how-llms-reason-over-graph-data/