EMBEDDED ANALYTICS

Embedded Analytics for Data Lakehouses: The 2026 Architecture Guide

December 18, 2025

Copy link icon for sharing the blog URL.

Rahul Pattamatta

CEO

The Data Lakehouse Revolution Meets Embedded Analytics

If you're building customer-facing analytics in 2026, you're likely facing a critical architectural decision: where should your analytics data live? Traditional data warehouses are expensive and rigid. Data lakes are cheap but lack query performance. The data lakehouse architecture promises the best of both worlds—but can it actually power real-time embedded analytics for your customers?

The answer is yes—but only with the right embedded analytics platform.

This post explores how modern data lakehouses (Databricks, Snowflake, AWS S3, Athena) have become the ideal foundation for embedded analytics, and why platforms like DataBrain are uniquely positioned to unlock their full potential for customer-facing use cases.

What you'll learn:

Why data lakehouses are transforming embedded analytics architecture
Which lakehouse platforms work best for different use cases
How to embed analytics directly from lakehouse data without ETL
Real implementation patterns with code examples
Performance considerations and optimization strategies
When to choose a lakehouse over traditional databases

Related Reading: If you're evaluating embedded analytics platforms, see our Why Self-Hosted Embedded Analytics Matter in 2025 guide and Multi-Tenancy Architecture Deep Dive.

What is a Data Lakehouse (and Why It Matters for Embedded Analytics)

A data lakehouse combines the low-cost, flexible storage of data lakes with the query performance and ACID transactions of data warehouses. Instead of maintaining separate systems for raw data (lake) and structured analytics (warehouse), you get a unified platform.

Traditional Architecture:

Lakehouse Architecture:

Key Characteristics of Data Lakehouses

Open File Formats — Store data as Parquet, Delta, Iceberg (not proprietary formats)
Metadata Layer — ACID transactions, schema enforcement, time travel
SQL Query Engine — Fast analytics without moving data
Separation of Storage and Compute — Scale independently
Direct Access — Query data where it lives, no ETL required

Why this matters for embedded analytics:

When your product data already lives in a lakehouse, you can embed analytics without the traditional pain points:

No ETL pipelines to maintain
No data replication to vendor caches
No expensive warehouse licenses for analytics-only workloads
Query your source of truth directly
Scale to unlimited customers without data duplication
Control costs with on-demand compute

Data Lakehouse Platforms: Which One for Embedded Analytics?

Not all lakehouse platforms are created equal for embedded analytics. Here's how the major options compare:

1. Databricks (Unity Catalog)

Best for: Organizations already invested in Databricks for ML/AI workloads who want to embed analytics from the same platform

Architecture:

Storage: Delta Lake on cloud object storage (S3, ADLS, GCS)
Compute: Serverless SQL warehouses
Metadata: Unity Catalog with fine-grained access control

Embedded Analytics Advantages:

JavaScript

// DataBrain connects directly to Databricks
const token = await generateGuestToken({
  datasourceName: 'Databricks Production',
  clientId: 'acme_corp',
  // RLS filters applied at query time
  params: {
    rlsSettings: [{
      metricId: 'revenue-metric',
      values: {
        tenant_id: customer.organizationId
      }
    }]
  }
});

‍

Unified governance - Same permissions for ML and analytics

Delta Lake performance - Optimized Parquet with indexing

Real-time streaming - Embed analytics on streaming data

Cost - Databricks compute can be expensive for high-concurrency analytics

Use Case: Fintech SaaS with ML-driven risk scoring. Embed customer dashboards showing model predictions and historical trends—all from the same Databricks workspace used for model training.

‍

2. Snowflake

Best for: Enterprises needing zero-maintenance scaling and multi-cloud portability

Architecture:

Storage: Micro-partitioned tables (proprietary format on cloud storage)
Compute: Auto-scaling virtual warehouses
Metadata: Integrated catalog with cloning and time travel
Open Formats: Native Iceberg table support (announced 2024) for interoperability

Embedded Analytics Advantages:

Python

# Backend token generation (Python/FastAPI)
token = databrain_client.create_guest_token(
    datasource_name="Snowflake DWH",
    client_id=customer.id,
    # Snowflake's query acceleration for dashboards
    warehouse="CUSTOMER_ANALYTICS_WH"
)

‍

Zero-ops scaling — Auto-suspend/resume for cost optimization

Data sharing — Share subsets with customers securely

Multi-region — Deploy close to customer data

Iceberg support — Query open-format tables alongside Snowflake native tables

Storage costs — More expensive than raw S3/ADLS

Use Case: B2B SaaS with 1,000+ customers. Use Snowflake's multi-cluster warehouses to isolate analytics compute per tier (free vs enterprise customers), with DataBrain handling the routing. Leverage Iceberg tables for data portability across cloud providers.

‍

3. AWS S3 + DuckDB

Best for: Cost-conscious startups and analytics on semi-structured data (logs, CSVs, Parquet)

Architecture:

Storage: S3 buckets with Parquet/CSV files
Compute: DuckDB (embedded analytical database)
Metadata: File-based partitioning schemes

How DataBrain Enables This:

DataBrain includes a built-in DuckDB integration that:

Reads Parquet/CSV directly from S3
Builds local indexes for fast queries
Handles incremental syncs as new files arrive

‍

TypeScript

// DataBrain automatically manages DuckDB for S3 sources
// Credentials are encrypted at rest and in transit
await connection.run(`
  INSTALL httpfs;
  LOAD httpfs;
  
  -- Configure S3 access with encrypted credentials
  SET s3_region = ?;
  SET s3_access_key_id = ?;
  SET s3_secret_access_key = ?;
`, [region, encryptedKeyId, encryptedSecret]);

// Query Parquet directly from S3 with tenant isolation
const query = `
  CREATE TABLE sales AS 
  SELECT * FROM read_parquet(?)
  WHERE tenant_id = ?;
`;
await connection.run(query, [s3Path, tenantId]);

‍

await connection.run(query, [s3Path, tenantId]);

Lowest cost — Pay only for S3 storage (~$0.023/GB/month)

Flexible schema — Handle JSON, CSV, Parquet interchangeably

No vendor lock-in — Open formats, portable compute

Concurrency limits — DuckDB is single-process (use sharding for scale)

‍

Use Case: Log analytics SaaS. Ingest customer application logs as Parquet files to S3, partition by customer_id, embed dashboards querying last 30 days of data.

‍

4. Amazon Athena

Best for: AWS-native stacks with existing S3 data lakes, minimal infrastructure management

Architecture:

Storage: S3 data lake with Glue Data Catalog
Compute: Serverless Presto engine
Metadata: AWS Glue for schema definitions

JavaScript

// DataBrain connects to Athena like any SQL database
const token = await createGuestToken({
  datasourceName: 'Athena Data Lake',
  clientId: customer.id,
  // Partition pruning for performance
  params: {
    appFilters: [{
      metricId: 'revenue-metric',
      values: { 
        date_partition: '2025-12', 
        customer_id: customer.id 
      }
    }]
  }
});

‍

True serverless — No clusters to manage

AWS ecosystem — Integrates with IAM, CloudWatch, QuickSight

Cost-effective — Pay per query ($5 per TB scanned)

Query latency — Cold starts can be 3-5 seconds

‍

Use Case: E-commerce platform with clickstream data in S3. Use Athena to query user behavior across all customers, embed filtered dashboards showing each merchant's traffic patterns.

5. Trino (formerly Presto SQL)

Best for: Querying multiple data sources (lakehouses, databases, APIs) from a single analytics interface

Architecture:

Storage: Federates across Hive, Delta Lake, Iceberg, PostgreSQL, etc.
Compute: Distributed SQL engine
Metadata: Connectors to various catalog systems

-- Trino's superpower: join across lakehouse and operational DB

SQL

-- Trino's superpower: join across lakehouse and operational DB
SELECT 
  orders.customer_id,
  SUM(orders.amount) AS revenue,
  customers.plan_tier
FROM delta.production.orders        -- Delta Lake on S3
JOIN postgres.users.customers       -- Live PostgreSQL
  ON orders.customer_id = customers.id
WHERE customers.tenant_id = 'acme_corp'
GROUP BY orders.customer_id, customers.plan_tier;

‍

Use Case: Logistics SaaS with shipment data in Delta Lake and customer profiles in PostgreSQL. Use Trino to join operational and analytical data in a single embedded dashboard.

Platform-Specific Setup Guides: For detailed configuration instructions, check our complete datasource documentation:

Why Embedded Analytics on Lakehouses Was Broken (Until Now)

Traditional embedded analytics tools (Tableau, Looker, Metabase embedded) were built for data warehouse architectures. When you try to use them with lakehouses, three problems emerge:

Problem 1: Query Performance Mismatch

Warehouse-era tools cache aggressively because they assume expensive query costs. But lakehouse queries on Delta/Iceberg with proper partitioning are cheap and fast. Excessive caching adds latency and staleness without benefit.

‍DataBrain's approach: Query directly with intelligent result streaming, not full result caching.

‍

Problem 2: Multi-Tenant Access Control

Warehouse-era tools use database roles for permissions. But lakehouses use table ACLs (Unity Catalog), IAM policies (AWS), or RLS at the file level (Iceberg). Your analytics tool needs to understand these paradigms.

DataBrain's approach:
Generate guest tokens with tenant context
Apply row-level security filters dynamically
Route queries through appropriate lakehouse access layers

Deep Dive: Read our comprehensive guide on The Multi-Tenancy Problem in Embedded Analytics to understand DataBrain's 4-level tenancy architecture (datasource, database, schema, table) and when to use each level for optimal data isolation.

JavaScript

// Backend: Generate token with lakehouse-aware RLS
const response = await fetch('https://api.usedatabrain.com/api/v2/guest-token/create', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${DATABRAIN_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    clientId: customer.organizationId,
    dataAppName: 'Customer Analytics',
    datasourceName: 'Databricks Unity',
    // RLS filters applied to ALL queries
    params: {
      rlsSettings: [{
        metricId: 'revenue-metric',
        values: {
          tenant_id: customer.organizationId
        }
      }]
    }
  })
});

const { token } = await response.json();

‍

Problem 3: Cost Explosion from Data Replication

Warehouse-era tools replicate your data into their own cache. For lakehouses with petabyte-scale storage, this is economically nonsensical—you're paying twice for the same data.

DataBrain's approach:

Never cache business data (only dashboard metadata)
Query results are ephemeral
Leverage lakehouse's own caching layers (Photon, Snowflake result cache)

‍

Implementation Guide: Embed Lakehouse Analytics in 4 Steps

Let's walk through a real example: embedding Databricks analytics into a React SaaS application.

Step 1: Configure Your Lakehouse in DataBrain

DataBrain Dashboard → Data Sources → Add Datasource

For Databricks:

JavaScript

// Connection details (stored encrypted in DataBrain)
{
  "serverHostname": "dbc-abc123-def456.cloud.databricks.com",
  "httpPath": "/sql/1.0/warehouses/xyz789",
  "token": "dapi***", // Personal access token
  "catalog": "production",  // Unity Catalog
  "schema": "analytics"
}

‍

For AWS S3 + DuckDB:

‍

Need Help Adding Your First Datasource?

If you're new to adding a datasource in DataBrain, check out these guides for step-by-step assistance:

Adding a Data Source to DataBrain — Complete setup walkthrough
Choosing the Right Tenancy Model — Match your data architecture to the right isolation level
Configure Tenants — Set up multi-tenant access control

Use these resources to ensure your datasource is configured correctly for your architecture and security needs.

Step 2: Build Your Dashboard (No-Code or SQL)

Create metrics using SQL against your lakehouse:

SQL

-- Revenue by month metric
SELECT 
  DATE_TRUNC('month', order_date) AS month,
  SUM(order_total) AS revenue,
  COUNT(DISTINCT customer_id) AS unique_customers
FROM analytics.orders
WHERE tenant_id = {{tenant_id}}  -- Dynamic filter
GROUP BY month
ORDER BY month DESC;

‍

DataBrain automatically:

Validates SQL against your lakehouse schema
Generates chart visualizations (bar, line, pie, etc.)
Handles date filters, grouping, aggregations
Optimizes queries with pushdown filters

‍

New to dashboard creation? Check out:

Create a Dashboard — Complete guide to building dashboards
Create a Metric — Metric creation tutorial
Custom SQL Guidelines — Best practices for SQL metrics

‍

Step 3: Backend Integration — Generate Guest Tokens

Your backend API generates short-lived tokens that:
Identify the customer (tenant isolation)
Specify which lakehouse datasource to query
Apply row-level security filters

‍Node.js / Express Example:

JavaScript

// routes/analytics.js
const express = require('express');
const axios = require('axios');
const router = express.Router();

router.get('/dashboard-token', authenticateUser, async (req, res) => {
  try {
    // Your auth middleware provides req.user
    const { organizationId, region } = req.user;

    const response = await axios.post(
      'https://api.usedatabrain.com/api/v2/guest-token/create',
      {
        clientId: organizationId,
        dataAppName: 'Customer Analytics',
        datasourceName: 'Databricks Unity',  // Your lakehouse
        expiryTime: 3600000,  // 1 hour
        params: {
          rlsSettings: [{
            metricId: 'all',  // Apply to all metrics
            values: {
              tenant_id: organizationId
            }
          }]
        }
      },
      {
        headers: {
          'Authorization': `Bearer ${process.env.DATABRAIN_API_KEY}`,
          'Content-Type': 'application/json'
        }
      }
    );

    res.json({ token: response.data.token });
  } catch (error) {
    console.error('Token generation failed:', error);
    res.status(500).json({ error: 'Failed to load analytics' });
  }
});

‍

Python / FastAPI Example:

Python

from fastapi import FastAPI, Depends, HTTPException
from fastapi.security import OAuth2PasswordBearer
import httpx
import os

app = FastAPI()
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token")

@app.get("/api/dashboard-token")
async def get_dashboard_token(token: str = Depends(oauth2_scheme)):
    # Verify user token and get organization context
    user = await verify_user_token(token)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.usedatabrain.com/api/v2/guest-token/create",
            json={
                "clientId": user.organization_id,
                "dataAppName": "Customer Analytics",
                "datasourceName": "Snowflake DWH",
                "expiryTime": 3600000,
                "params": {
                    "rlsSettings": [{
                        "metricId": "all",
                        "values": {
                            "tenant_id": user.organization_id
                        }
                    }]
                }
            },
            headers={
                "Authorization": f"Bearer {os.environ['DATABRAIN_API_KEY']}",
                "Content-Type": "application/json"
            }
        )

        if response.status_code != 200:
            raise HTTPException(
                status_code=500,
                detail="Token generation failed"
            )

        return response.json()

‍

Step 4: Frontend Integration — Embed the Dashboard

React Component:

‍

Vue 3 Example:

Vue

<template>
  <div class="dashboard-container">
    <div v-if="loading" class="loading-spinner">
      Loading analytics...
    </div>
    <div v-else-if="error" class="error-message">
      {{ error }}
    </div>
    <dbn-dashboard 
      v-else
      :token="token"
      dashboard-id="customer-analytics"
      theme="light"
      enable-download-csv
    />
  </div>
</template>

<script setup lang="ts">
import { ref, onMounted } from 'vue';
import '@databrainhq/plugin/web';

const token = ref<string | null>(null);
const loading = ref(true);
const error = ref<string | null>(null);

onMounted(async () => {
  try {
    const response = await fetch('/api/dashboard-token', {
      credentials: 'include'
    });
    
    if (!response.ok) throw new Error('Failed to load analytics');
    
    const data = await response.json();
    token.value = data.token;
  } catch (err) {
    error.value = err instanceof Error ? err.message : 'Unknown error';
  } finally {
    loading.value = false;
  }
});
</script>

‍

Framework-Specific Integration:

React Integration Guide — Complete React setup with hooks and TypeScript
Vue.js Integration — Vue 3 composition API examples
Angular Integration — Angular module setup
Web Component API Reference — Complete <dbn-dashboard> documentation

Performance Optimization for Lakehouse Analytics

Embedding analytics on a lakehouse requires understanding lakehouse-specific optimizations:

1. Partition Pruning

Structure your lakehouse data with partition columns:

SQL

-- Good: Partitioned by tenant and date
s3://bucket/orders/
  tenant_id=acme/
    year=2025/
      month=12/
        data.parquet

‍

DataBrain automatically includes tenant filters:

SQL

-- Your metric SQL
SELECT SUM(revenue)
FROM orders
WHERE tenant_id = {{tenant_id}}

-- Lakehouse only scans:
-- s3://bucket/orders/tenant_id=acme/
-- Ignores all other partitions → 100x faster

‍

2. Delta Lake / Iceberg Optimizations

Enable Z-ordering on frequently filtered columns:

SQL

-- Databricks Delta Lake
OPTIMIZE orders
ZORDER BY (tenant_id, order_date);

‍

Result: Queries filtering by tenant_id read 10-50x less data.

3. Lakehouse Result Caching

Snowflake: Result cache automatically shares query results across users viewing the same dashboard.
Databricks: Photon engine caches intermediate results for 24 hours.
DataBrain leverages these by generating consistent SQL queries—same metric for different users with different RLS produces cache-friendly queries.

4. Incremental Metrics

For large historical datasets, use incremental aggregation:

SQL

-- Instead of scanning all history every time
SELECT
  DATE(order_date) AS day,
  SUM(revenue)
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY day;

DataBrain's dynamic date filters automatically apply relative date ranges, ensuring queries stay performant as data grows.

Related Performance Content:

Building High-Performance Multi-Tenant Analytics with ClickHouse — Columnar database optimization strategies and 100x performance improvements
Cross-Catalog Queries with Trino — Query federation performance tips and lakehouse optimization patterns

‍

Cost Comparison: Lakehouse vs Traditional Warehouse for Embedded Analytics

Scenario: 500 customers, each viewing dashboards 10x/month, 1TB of analytical data

Architecture	Storage Cost/Month	Compute Cost/Month	Total
Traditional DWH (Snowflake medium)	$80	$800 (always-on warehouse)	$880
Lakehouse (S3 + Databricks serverless)	$23 (S3)	$120 (on-demand)	$143
Savings			84% reduction

Why the difference:

Storage: S3 is ~0.023/GB vs Snowflake's ~40/TB (with compression)
Compute: Serverless SQL only runs during actual queries
No vendor cache: Not paying to store data twice

When to Choose Lakehouse Over Traditional Database

Choose a lakehouse when:

You have >100GB of analytical data
Data arrives in batches (hourly/daily) rather than real-time transactions
You need to query semi-structured data (JSON, logs, events)
Cost optimization is critical (startup or high data volume)
You want to avoid vendor lock-in (open formats)

Choose traditional databases when:

You need sub-second updates (operational dashboards)
Data is and queries are simple
You're already paying for a warehouse and it performs well
Team lacks lakehouse expertise

Best of both worlds: Use DataBrain's multi-datasource support to combine:

Operational DB (PostgreSQL) for real-time metrics
Lakehouse (Databricks) for historical trends
Single embedded dashboard joins both

Related Resources

Technical Deep Dive

Data Residency for Embedded Analytics — Multi-region lakehouse architectures, GDPR compliance, and data sovereignty patterns
N-Level Tenancy Architecture — Understanding datasource, database, schema, and table-level isolation for SaaS applications
Why Self-Hosted Embedded Analytics Matter — Self-hosted vs cloud deployment, air-gapped environments, and control requirements
Building High-Performance Analytics with ClickHouse — Columnar database strategies for 100x performance improvements
Cross-Catalog Queries with Trino — Query federation across multiple lakehouses and databases

Core Documentation

Embedding Architecture Overview — How DataBrain's embedding works under the hood
Multi-Datasource Workspaces — Configure routing to multiple lakehouses based on tenant context
Choosing the Right Tenancy Model — Decision framework for datasource, database, schema, or table-level tenancy
Configure Tenants — Set up multi-tenant data isolation and RLS policies ‍
5-Minute Embedding Tutorial — Get your first lakehouse dashboard embedded quickly‍
Production Setup Guide — Complete implementation walkthrough with security best practices‍
Guest Token API Reference — Full API documentation for token generation and management‍
Framework-Specific Guides — React, Vue, Angular, Next.js integration examples

Security & Compliance

Security Best Practices — Comprehensive security guide for embedded analytics
Multi-Tenant Access Control — RLS patterns and tenant isolation strategies

Next Steps: Start Embedding Lakehouse Analytics

Ready to embed analytics from your lakehouse?

Try DataBrain Free — Connect your Databricks, Snowflake, or S3 in minutes
5-Minute Tutorial — Get your first lakehouse dashboard embedded
Talk to Sales — Get a lakehouse-specific architecture review and implementation plan

Continue Learning

Explore related topics to deepen your understanding:
Multi-region data compliance → Data Residency Guide
High-performance analytics → ClickHouse Guide
Complete embedding setup → Production Deployment Guide

Embedded Analytics for Data Lakehouses: The 2026 Architecture Guide

The Data Lakehouse Revolution Meets Embedded Analytics

What is a Data Lakehouse (and Why It Matters for Embedded Analytics)

Key Characteristics of Data Lakehouses

Why this matters for embedded analytics:

Data Lakehouse Platforms: Which One for Embedded Analytics?

1. Databricks (Unity Catalog)

2. Snowflake

3. AWS S3 + DuckDB

4. Amazon Athena

5. Trino (formerly Presto SQL)

Why Embedded Analytics on Lakehouses Was Broken (Until Now)

Problem 1: Query Performance Mismatch

Problem 2: Multi-Tenant Access Control

Problem 3: Cost Explosion from Data Replication

Implementation Guide: Embed Lakehouse Analytics in 4 Steps

Step 1: Configure Your Lakehouse in DataBrain

Step 2: Build Your Dashboard (No-Code or SQL)

Step 3: Backend Integration — Generate Guest Tokens

Step 4: Frontend Integration — Embed the Dashboard

Performance Optimization for Lakehouse Analytics

1. Partition Pruning

2. Delta Lake / Iceberg Optimizations

3. Lakehouse Result Caching

4. Incremental Metrics

Cost Comparison: Lakehouse vs Traditional Warehouse for Embedded Analytics

When to Choose Lakehouse Over Traditional Database

Related Resources

Technical Deep Dive

Core Documentation

Security & Compliance

Next Steps: Start Embedding Lakehouse Analytics

Ready to embed analytics from your lakehouse?

Continue Learning

Related Articles

Make analytics your competitive advantage