ClickHouse Brings Real-Time Lakehouse Analytics via Microsoft OneLake

ChatGPT · Nov 19, 2025

ClickHouse’s new integration with Microsoft OneLake delivers a direct bridge between two fast-moving pieces of the modern data stack: the high‑performance, real‑time analytics engine ClickHouse and Microsoft Fabric’s tenant‑wide logical lake, OneLake. Announced in mid‑November 2025, the integration lets ClickHouse discover and query Iceberg tables surfaced by OneLake’s Table APIs, enabling low‑latency, high‑concurrency analytics on data that remains governed and cataloged inside OneLake. This is a significant step for organizations pursuing a lakehouse architecture that demands both open‑format interoperability and real‑time performance.

Background / Overview

Enterprises are standardizing on lakehouse patterns—storing data in open formats (Apache Iceberg, Delta Lake, Parquet) inside a single, governed surface while letting best‑of‑breed engines operate on that same data. Microsoft’s OneLake is explicitly designed to be this tenant‑wide, logical lake with format virtualization, shortcuts, mirroring, and Table APIs that expose table metadata and make in‑place reads possible for external engines. ClickHouse’s announcement leverages OneLake’s Iceberg REST catalog endpoint so ClickHouse can discover Iceberg metadata and push queries down to data kept inside OneLake. These technical building blocks are now available in preview / beta across the vendors’ stacks. ClickHouse frames the integration as part of broader interoperability with Microsoft: ClickHouse Cloud is available on Azure, and the vendor has worked to improve ClickHouse performance on Azure Blob Storage and to make cloud onboarding smoother. Microsoft’s OneLake Table APIs themselves are still flagged as preview features and include explicit limitations (metadata read operations first; write/metadata write operations are staged for future releases), which shapes how organizations should plan adoption. Both vendors are clear that the initial release is focused on read/query interoperability and that write‑back and full metadata write operations are on the roadmap.

What was announced — the essentials

ClickHouse now supports direct querying of Apache Iceberg tables exposed by Microsoft OneLake’s Table APIs. This capability began shipping as a beta feature in ClickHouse’s November release (25.11) and is being rolled into ClickHouse Cloud shortly after.
The integration uses OneLake’s Iceberg REST catalog endpoint to discover namespaces and tables and to obtain metadata needed by ClickHouse to execute reads against the underlying Parquet files. Authentication is handled via Microsoft Entra ID tokens (formerly Azure AD).
Microsoft’s OneLake supports format virtualization between Delta Lake and Apache Iceberg formats; OneLake will expose virtualized metadata so consumers using Iceberg readers (like this ClickHouse integration) can read tables that may have been written in Delta format. The OneLake APIs are currently in preview and have documented limitations regarding metadata writes.

These are the load‑bearing technical claims: ClickHouse’s ability to discover Iceberg metadata in OneLake, OneLake’s REST catalog endpoint and format virtualization, and the security model using Microsoft Entra ID. All three are described in vendor documentation and product posts.

Why this matters: practical value for data teams

1. Reduced data movement, retained governance

OneLake’s design goal is to let many engines query the same underlying files without duplicating data. By enabling ClickHouse to query those tables directly, organizations can avoid ETL or rehydration steps for workloads that need real‑time insight on governed, cataloged data. This reduces copy sprawl and simplifies governance because the authoritative metadata and lineage remain in OneLake rather than being dispersed across engine‑specific silos.

2. Real‑time analytics and operational dashboards

ClickHouse’s engine is optimized for low‑latency analytical queries at very high concurrency. When combined with OneLake as the single lakehouse, this enables new operational workloads such as high‑frequency analytics, live dashboards, observability pipelines, and agent‑facing analytics where AI agents or copilots require fast, deterministic answers against large datasets. ClickHouse and Microsoft both highlight use cases like observability, fraud detection, and real‑time telemetry analytics as primary beneficiaries.

3. Open formats and ecosystem flexibility

By relying on Iceberg/Delta and Parquet, the integration preserves portability: teams can continue to use Spark, Databricks, Fabric workloads, or ClickHouse against the same data. This reduces lock‑in and supports mixed‑engine architectures where the “right tool for the job” can be selected for each workload. The integration explicitly leans on OneLake’s virtualization layer so a table written as Delta in Fabric can still be read by Iceberg clients.

How the integration works — technical snapshot

Discovery and catalog access

OneLake exposes an Iceberg REST Catalog endpoint at https://onelake.table.fabric.microsoft.com/iceberg. ClickHouse connects to that endpoint as a catalog, enumerates namespaces and tables, and reads table metadata (partitions, manifests, file lists) to plan queries. The OneLake Table APIs provide the set of GET operations ClickHouse needs to drive discovery and reads.

Query path and security

ClickHouse creates a DataLake catalog configuration that points to the OneLake Iceberg catalog. ClickHouse then issues SQL queries as it normally would; the ClickHouse engine resolves the table metadata via the catalog and performs reads against the Parquet/manifest files referenced by OneLake metadata. Authentication is delegated through Microsoft Entra ID tokens—service principals or workspace permissions—so access control remains centralized in OneLake/Fabric. The ClickHouse team published example DDL and connection settings in their technical post and in the 25.11 documentation.

Current limitations (important to plan for)

OneLake Table APIs are currently documented as preview; many operations are read‑focused and metadata write operations are not yet generally available. Expect the integration to start as read‑only (for Iceberg metadata) and evolve toward write support in later releases.

Getting started — what engineers will do first

The ClickHouse blog shows the initial steps required to connect ClickHouse to OneLake. In practice, a typical onboarding sequence looks like:

Provision a ClickHouse instance (local, self‑managed, or ClickHouse Cloud) using a release that contains the OneLake Iceberg catalog support (25.11+).
Ensure your Fabric tenant/workspace has the OneLake Table API enabled and that the tables you want to query are accessible (Iceberg folders with metadata or virtualized Delta tables). Confirm that Delta ↔ Iceberg conversion/virtualization is configured if needed.
Create a catalog in ClickHouse that points to the OneLake Iceberg endpoint and supply Entra ID credentials (service principal or delegated token) with appropriate scope to read metadata and storage.
Run discovery commands (SHOW TABLES FROM onelake_catalog) and then execute queries (SELECT … FROM onelake_catalog.schema.table) as you would against any other ClickHouse database.

These steps are intentionally simple: the initial feature aims to minimize engineering friction for read‑in‑place analytics. However, real production deployments require additional validation around performance, caching, cost, and governance (see sections below).

Performance, cost, and operational considerations

Performance tradeoffs

ClickHouse’s strengths are clear in CPU‑bound, highly selective query patterns and in aggregations over large datasets. But when ClickHouse reads files sitting in object storage via OneLake, performance becomes a function of:

Storage tier and locality (hot vs cool vs archive), which affects read latency.
OneLake shortcut caching and any workspace‑level caching policies.
The volume of scanned data and whether predicate pushdown and partition pruning are effective for the specific query pattern.

Enterprises should benchmark realistic query shapes on target storage tiers rather than assuming on‑prem-like performance. OneLake’s documentation calls out caching semantics and limitations; these should be included in early performance tests.

Cost modeling

Querying large Parquet datasets directly from object storage can increase egress and I/O cost if not controlled. OneLake caching can reduce repeated egress, but cache sizing and retention windows must be planned and tested. Operational teams should model:

Per‑query IO cost (scans × bytes read) vs pre‑aggregating or staging to hot analytics storage.
Cache hit rates and the impact of concurrency on cache efficiency.
Chargeback/FinOps policies when backups or shared data are used by multiple teams.

Vendor messaging sometimes cites headline savings from zero‑copy designs; those figures are environment‑dependent and should be validated with pilot measurements.

Data consistency and restore semantics

If enterprises treat certain OneLake artifacts as backups or point‑in‑time snapshots, they must verify that the Iceberg/Delta conversion preserves the application‑consistent semantics required for restores. Converting backups into queryable tables is powerful, but it must not impair disaster recovery playbooks. The integration’s read‑first posture places responsibility on customers to validate restore workflows.

Governance, security, and compliance

Authentication and authorization remain centralized: OneLake uses Microsoft Entra ID and Fabric workspace permissions. ClickHouse relies on those tokens for access, meaning policy and lineage live with the OneLake catalog. This preserves a single control plane for access and auditing.
However, exposing previously isolated artifacts (backups, mirrored datasets, or raw logs) to more consumer engines increases the attack surface. Organizations must map RBAC, retention, and legal‑hold semantics into OneLake’s catalog and ensure audit logs, SIEM integration, and least‑privilege tests are part of any rollout. OneLake and Fabric provide the primitives, but tenant admins must enforce them.
Data lineage and catalog visibility are preserved because OneLake manages the metadata surface. This helps compliance and simplifies traceability for governed AI/GenAI scenarios where provenance is essential.

Use cases that benefit most (and those that don’t)

Best fits

Real‑time observability and monitoring where teams need fast analytics on large telemetry streams without duplicating data.
Cybersecurity and fraud detection use cases that require combining historical logs with near‑real‑time event streams and supporting high‑concurrency queries.
Agent‑facing analytics where AI agents or Copilots need deterministic analytics (counts, aggregates) and semantic search from a single, governed surface.

Poor fits (or cases to evaluate carefully)

Very heavy full‑table scans against large cold/archival tiers without effective pruning or caching—these can be expensive and slow.
Workloads that require frequent metadata writes or transactional semantics that aren’t supported yet by OneLake Table APIs (metadata write operations remain a roadmap item).

Vendor messaging vs. realities — a critical look

ClickHouse and Microsoft pitch the story as reducing friction and enabling “zero‑copy” analytics: ClickHouse queries the same governed OneLake tables used by Fabric workloads. The technical plumbing—Iceberg REST Catalog, format virtualization, and Entra‑based auth—exists and is documented. But several caveats matter for realistic planning:

The OneLake Table APIs and some virtualization features are still in preview (with documented limitations) and therefore may behave differently across regions or tenant tiers. Early adopters should test the exact workspace and region combination they intend to use.
Many efficiency and cost claims (e.g., headline storage savings from eliminating copies) are workload‑dependent. Real savings require careful FinOps modeling that includes cache behavior, query patterns, and storage tier selection. Treat vendor percentages as hypotheses to be validated in a proof of concept.
ClickHouse’s initial release is read‑oriented; full write‑back and richer metadata operations are planned for the roadmap. If your architecture requires bidirectional metadata or transactional writes from ClickHouse to OneLake today, you’ll need to plan for a future upgrade.

Practical rollout checklist (recommended pilot approach)

Define success metrics: latency SLOs, cost per TB scanned, and governance completeness.
Select a representative dataset and workload that mirrors production query shapes.
Validate OneLake Table API availability and preview/GA differences for your region and tenancy.
Configure ClickHouse catalog with Entra ID credentials and run discovery tests (SHOW TABLES, simple counts).
Measure cold vs warm cache performance, and model egress/IO costs.
Run security and restore validation: confirm that metadata conversion does not break restore semantics and that RBAC/audit logs meet compliance.
Iterate and harden runbooks, FinOps rules, and quota/alerting to prevent runaway scans on archival tables.

This measured approach limits risk and surfaces operational tradeoffs early, enabling a safe, controlled expansion from pilot to production.

Roadmap signals and what to watch next

ClickHouse indicates the integration will move from beta to GA, add write support, and introduce enhanced cloud console integration for OneLake in ClickHouse Cloud. These are the obvious next steps for full parity with two‑way lakehouse workflows.
Microsoft will continue expanding Table API coverage, writing operations, and format virtualization enhancements. Track OneLake’s preview → GA transitions and the growth of supported API surface for Iceberg/Delta metadata writes.
Third‑party ecosystem integrations (ETL vendors, streaming platforms, and data movement providers) are rapidly adding first‑class hooks into OneLake; their maturity will reduce operational friction for large organizations moving to this model. Expect more partner announcements and tighter FinOps tooling across Fabric and partner ecosystems.

Final assessment: strengths, risks, and strategic fit

The ClickHouse–OneLake integration is an important interoperability milestone for organizations standardizing on a lakehouse footprint while demanding low‑latency analytics. Its primary strengths are:

Reduced data movement and centralized governance through OneLake’s catalog and access controls.
Access to ClickHouse’s real‑time engine for high‑concurrency analytics on governed data.
Open‑format portability (Iceberg, Delta, Parquet) that preserves engine choice and portability.

The principal risks and caveats are:

Preview status and feature gaps—OneLake Table APIs are in preview, and initial ClickHouse support is read‑oriented (beta), so expect staged rollouts and region/workspace nuance.
Performance and cost tradeoffs when reading cold/archival storage or when scan-heavy queries are executed without careful pruning, caching, or pre‑aggregation.
Governance and recovery semantics—treat vendor‑stated efficiencies as starting points for careful, tenant‑specific validation, including restore testing and least‑privilege enforcement.

For organizations that need timely, governed analytics across a broad set of engines—including GenAI and agentic applications—the integration is a practical, lower‑friction way to bring ClickHouse performance to data already governed inside OneLake. For conservative, compliance‑sensitive enterprises, the prudent move is a controlled pilot that validates performance, cost, and governance before full production rollout.

ClickHouse’s technical post, Microsoft’s OneLake Table APIs documentation, and the ClickHouse/Microsoft press releases together paint a coherent picture: the plumbing is in place to let ClickHouse query OneLake Iceberg metadata today, with write support and deeper cloud UI integration planned in follow‑on releases. Organizations should use this opportunity to align their architecture, FinOps, security controls, and runbooks so that they can safely and efficiently exploit a multi‑engine lakehouse model.

Source: HPCwire ClickHouse Announces Microsoft OneLake Integration for Seamless Data Interoperability - BigDATAwire

Search

Navigation section

ClickHouse Brings Real-Time Lakehouse Analytics via Microsoft OneLake

Background / Overview

What was announced — the essentials

Why this matters: practical value for data teams

1. Reduced data movement, retained governance

2. Real‑time analytics and operational dashboards

3. Open formats and ecosystem flexibility

How the integration works — technical snapshot

Discovery and catalog access

Query path and security

Current limitations (important to plan for)

Getting started — what engineers will do first

Performance, cost, and operational considerations

Performance tradeoffs

Cost modeling

Data consistency and restore semantics

Governance, security, and compliance

Use cases that benefit most (and those that don’t)

Best fits

Poor fits (or cases to evaluate carefully)

Vendor messaging vs. realities — a critical look

Practical rollout checklist (recommended pilot approach)

Roadmap signals and what to watch next

Final assessment: strengths, risks, and strategic fit

Similar threads

Navigation section

ClickHouse Brings Real-Time Lakehouse Analytics via Microsoft OneLake

What was announced — the essentials​

Why this matters: practical value for data teams​

1. Reduced data movement, retained governance​

2. Real‑time analytics and operational dashboards​

3. Open formats and ecosystem flexibility​

How the integration works — technical snapshot​

Discovery and catalog access​

Query path and security​

Current limitations (important to plan for)​

Getting started — what engineers will do first​

Performance, cost, and operational considerations​

Performance tradeoffs​

Cost modeling​

Data consistency and restore semantics​

Governance, security, and compliance​

Use cases that benefit most (and those that don’t)​

Best fits​

Poor fits (or cases to evaluate carefully)​

Vendor messaging vs. realities — a critical look​

Practical rollout checklist (recommended pilot approach)​

Roadmap signals and what to watch next​

Final assessment: strengths, risks, and strategic fit​

Similar threads

What was announced — the essentials

Why this matters: practical value for data teams

1. Reduced data movement, retained governance

2. Real‑time analytics and operational dashboards

3. Open formats and ecosystem flexibility

How the integration works — technical snapshot

Discovery and catalog access

Query path and security

Current limitations (important to plan for)

Getting started — what engineers will do first

Performance, cost, and operational considerations

Performance tradeoffs

Cost modeling

Data consistency and restore semantics

Governance, security, and compliance

Use cases that benefit most (and those that don’t)

Best fits

Poor fits (or cases to evaluate carefully)

Vendor messaging vs. realities — a critical look

Practical rollout checklist (recommended pilot approach)

Roadmap signals and what to watch next

Final assessment: strengths, risks, and strategic fit