Embedded Analytics for Data Lakehouses: The 2026 Architecture Guide

December 18, 2025
LinkedIn iconCopy link icon for sharing the blog URL.
Rahul Pattamatta
CEO

The Data Lakehouse Revolution Meets Embedded Analytics

If you're building customer-facing analytics in 2026, you're likely facing a critical architectural decision: where should your analytics data live? Traditional data warehouses are expensive and rigid. Data lakes are cheap but lack query performance. The data lakehouse architecture promises the best of both worlds—but can it actually power real-time embedded analytics for your customers?

The answer is yes—but only with the right embedded analytics platform.

This post explores how modern data lakehouses (Databricks, Snowflake, AWS S3, Athena) have become the ideal foundation for embedded analytics, and why platforms like DataBrain are uniquely positioned to unlock their full potential for customer-facing use cases.

What you'll learn:

  • Why data lakehouses are transforming embedded analytics architecture
  • Which lakehouse platforms work best for different use cases
  • How to embed analytics directly from lakehouse data without ETL
  • Real implementation patterns with code examples
  • Performance considerations and optimization strategies
  • When to choose a lakehouse over traditional databases

Related Reading: If you're evaluating embedded analytics platforms, see our Why Self-Hosted Embedded Analytics Matter in 2025 guide and Multi-Tenancy Architecture Deep Dive.

What is a Data Lakehouse (and Why It Matters for Embedded Analytics)

A data lakehouse combines the low-cost, flexible storage of data lakes with the query performance and ACID transactions of data warehouses. Instead of maintaining separate systems for raw data (lake) and structured analytics (warehouse), you get a unified platform.

Traditional Architecture:

Traditional Architecture

Lakehouse Architecture:

Lakehouse Architecture

Key Characteristics of Data Lakehouses

  1. Open File Formats — Store data as Parquet, Delta, Iceberg (not proprietary formats)
  2. Metadata Layer — ACID transactions, schema enforcement, time travel
  3. SQL Query Engine — Fast analytics without moving data
  4. Separation of Storage and Compute — Scale independently
  5. Direct Access — Query data where it lives, no ETL required

Why this matters for embedded analytics:

When your product data already lives in a lakehouse, you can embed analytics without the traditional pain points:

  • No ETL pipelines to maintain
  • No data replication to vendor caches
  • No expensive warehouse licenses for analytics-only workloads
  • Query your source of truth directly
  • Scale to unlimited customers without data duplication
  • Control costs with on-demand compute

Data Lakehouse Platforms: Which One for Embedded Analytics?

Not all lakehouse platforms are created equal for embedded analytics. Here's how the major options compare:

1. Databricks (Unity Catalog)

Best for: Organizations already invested in Databricks for ML/AI workloads who want to embed analytics from the same platform

Architecture:

  • Storage: Delta Lake on cloud object storage (S3, ADLS, GCS)
  • Compute: Serverless SQL warehouses
  • Metadata: Unity Catalog with fine-grained access control

Embedded Analytics Advantages:

JavaScript
// DataBrain connects directly to Databricks
const token = await generateGuestToken({
  datasourceName: 'Databricks Production',
  clientId: 'acme_corp',
  // RLS filters applied at query time
  params: {
    rlsSettings: [{
      metricId: 'revenue-metric',
      values: {
        tenant_id: customer.organizationId
      }
    }]
  }
});

Unified governance - Same permissions for ML and analytics

Delta Lake performance - Optimized Parquet with indexing

Real-time streaming - Embed analytics on streaming data

Cost - Databricks compute can be expensive for high-concurrency analytics

Use Case: Fintech SaaS with ML-driven risk scoring. Embed customer dashboards showing model predictions and historical trends—all from the same Databricks workspace used for model training.

2. Snowflake

Best for: Enterprises needing zero-maintenance scaling and multi-cloud portability

Architecture:

  • Storage: Micro-partitioned tables (proprietary format on cloud storage)
  • Compute: Auto-scaling virtual warehouses
  • Metadata: Integrated catalog with cloning and time travel
  • Open Formats: Native Iceberg table support (announced 2024) for interoperability

Embedded Analytics Advantages:

Python
# Backend token generation (Python/FastAPI)
token = databrain_client.create_guest_token(
    datasource_name="Snowflake DWH",
    client_id=customer.id,
    # Snowflake's query acceleration for dashboards
    warehouse="CUSTOMER_ANALYTICS_WH"
)

Zero-ops scaling — Auto-suspend/resume for cost optimization

Data sharing — Share subsets with customers securely

Multi-region — Deploy close to customer data

Iceberg support — Query open-format tables alongside Snowflake native tables

Storage costs — More expensive than raw S3/ADLS

Use Case: B2B SaaS with 1,000+ customers. Use Snowflake's multi-cluster warehouses to isolate analytics compute per tier (free vs enterprise customers), with DataBrain handling the routing. Leverage Iceberg tables for data portability across cloud providers.

3. AWS S3 + DuckDB

Best for: Cost-conscious startups and analytics on semi-structured data (logs, CSVs, Parquet)

Architecture:

  • Storage: S3 buckets with Parquet/CSV files
  • Compute: DuckDB (embedded analytical database)
  • Metadata: File-based partitioning schemes

How DataBrain Enables This:

DataBrain includes a built-in DuckDB integration that:

  1. Reads Parquet/CSV directly from S3
  2. Builds local indexes for fast queries
  3. Handles incremental syncs as new files arrive

TypeScript
// DataBrain automatically manages DuckDB for S3 sources
// Credentials are encrypted at rest and in transit
await connection.run(`
  INSTALL httpfs;
  LOAD httpfs;
  
  -- Configure S3 access with encrypted credentials
  SET s3_region = ?;
  SET s3_access_key_id = ?;
  SET s3_secret_access_key = ?;
`, [region, encryptedKeyId, encryptedSecret]);

// Query Parquet directly from S3 with tenant isolation
const query = `
  CREATE TABLE sales AS 
  SELECT * FROM read_parquet(?)
  WHERE tenant_id = ?;
`;
await connection.run(query, [s3Path, tenantId]);

await connection.run(query, [s3Path, tenantId]);

Lowest cost — Pay only for S3 storage (~$0.023/GB/month)

Flexible schema — Handle JSON, CSV, Parquet interchangeably

No vendor lock-in — Open formats, portable compute

Concurrency limits — DuckDB is single-process (use sharding for scale)

Use Case: Log analytics SaaS. Ingest customer application logs as Parquet files to S3, partition by customer_id, embed dashboards querying last 30 days of data.

4. Amazon Athena

Best for: AWS-native stacks with existing S3 data lakes, minimal infrastructure management

Architecture:

  • Storage: S3 data lake with Glue Data Catalog
  • Compute: Serverless Presto engine
  • Metadata: AWS Glue for schema definitions
JavaScript
// DataBrain connects to Athena like any SQL database
const token = await createGuestToken({
  datasourceName: 'Athena Data Lake',
  clientId: customer.id,
  // Partition pruning for performance
  params: {
    appFilters: [{
      metricId: 'revenue-metric',
      values: { 
        date_partition: '2025-12', 
        customer_id: customer.id 
      }
    }]
  }
});

True serverless — No clusters to manage

AWS ecosystem — Integrates with IAM, CloudWatch, QuickSight

Cost-effective — Pay per query ($5 per TB scanned)

Query latency — Cold starts can be 3-5 seconds

Use Case: E-commerce platform with clickstream data in S3. Use Athena to query user behavior across all customers, embed filtered dashboards showing each merchant's traffic patterns.

5. Trino (formerly Presto SQL)

Best for: Querying multiple data sources (lakehouses, databases, APIs) from a single analytics interface

Architecture:

  • Storage: Federates across Hive, Delta Lake, Iceberg, PostgreSQL, etc.
  • Compute: Distributed SQL engine
  • Metadata: Connectors to various catalog systems

-- Trino's superpower: join across lakehouse and operational DB

SQL
-- Trino's superpower: join across lakehouse and operational DB
SELECT 
  orders.customer_id,
  SUM(orders.amount) AS revenue,
  customers.plan_tier
FROM delta.production.orders        -- Delta Lake on S3
JOIN postgres.users.customers       -- Live PostgreSQL
  ON orders.customer_id = customers.id
WHERE customers.tenant_id = 'acme_corp'
GROUP BY orders.customer_id, customers.plan_tier;

Use Case: Logistics SaaS with shipment data in Delta Lake and customer profiles in PostgreSQL. Use Trino to join operational and analytical data in a single embedded dashboard.

Platform-Specific Setup Guides: For detailed configuration instructions, check our complete datasource documentation:

Why Embedded Analytics on Lakehouses Was Broken (Until Now)

Traditional embedded analytics tools (Tableau, Looker, Metabase embedded) were built for data warehouse architectures. When you try to use them with lakehouses, three problems emerge:

Problem 1: Query Performance Mismatch

Warehouse-era tools cache aggressively because they assume expensive query costs. But lakehouse queries on Delta/Iceberg with proper partitioning are cheap and fast. Excessive caching adds latency and staleness without benefit.

DataBrain's approach: Query directly with intelligent result streaming, not full result caching.

Problem 2: Multi-Tenant Access Control

Warehouse-era tools use database roles for permissions. But lakehouses use table ACLs (Unity Catalog), IAM policies (AWS), or RLS at the file level (Iceberg). Your analytics tool needs to understand these paradigms.

  • DataBrain's approach:
  • Generate guest tokens with tenant context
  • Apply row-level security filters dynamically
  • Route queries through appropriate lakehouse access layers

Deep Dive: Read our comprehensive guide on The Multi-Tenancy Problem in Embedded Analytics to understand DataBrain's 4-level tenancy architecture (datasource, database, schema, table) and when to use each level for optimal data isolation.

JavaScript
// Backend: Generate token with lakehouse-aware RLS
const response = await fetch('https://api.usedatabrain.com/api/v2/guest-token/create', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${DATABRAIN_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    clientId: customer.organizationId,
    dataAppName: 'Customer Analytics',
    datasourceName: 'Databricks Unity',
    // RLS filters applied to ALL queries
    params: {
      rlsSettings: [{
        metricId: 'revenue-metric',
        values: {
          tenant_id: customer.organizationId
        }
      }]
    }
  })
});

const { token } = await response.json();

Problem 3: Cost Explosion from Data Replication

Warehouse-era tools replicate your data into their own cache. For lakehouses with petabyte-scale storage, this is economically nonsensical—you're paying twice for the same data.

DataBrain's approach:

  • Never cache business data (only dashboard metadata)
  • Query results are ephemeral
  • Leverage lakehouse's own caching layers (Photon, Snowflake result cache)

Implementation Guide: Embed Lakehouse Analytics in 4 Steps

Let's walk through a real example: embedding Databricks analytics into a React SaaS application.

Step 1: Configure Your Lakehouse in DataBrain

DataBrain Dashboard → Data Sources → Add Datasource

For Databricks:

JavaScript
// Connection details (stored encrypted in DataBrain)
{
  "serverHostname": "dbc-abc123-def456.cloud.databricks.com",
  "httpPath": "/sql/1.0/warehouses/xyz789",
  "token": "dapi***", // Personal access token
  "catalog": "production",  // Unity Catalog
  "schema": "analytics"
}

For AWS S3 + DuckDB:

Need Help Adding Your First Datasource?

If you're new to adding a datasource in DataBrain, check out these guides for step-by-step assistance:

Use these resources to ensure your datasource is configured correctly for your architecture and security needs.

Step 2: Build Your Dashboard (No-Code or SQL)

Create metrics using SQL against your lakehouse:

SQL
-- Revenue by month metric
SELECT 
  DATE_TRUNC('month', order_date) AS month,
  SUM(order_total) AS revenue,
  COUNT(DISTINCT customer_id) AS unique_customers
FROM analytics.orders
WHERE tenant_id = {{tenant_id}}  -- Dynamic filter
GROUP BY month
ORDER BY month DESC;

DataBrain automatically:

  • Validates SQL against your lakehouse schema
  • Generates chart visualizations (bar, line, pie, etc.)
  • Handles date filters, grouping, aggregations
  • Optimizes queries with pushdown filters

New to dashboard creation? Check out:

Step 3: Backend Integration — Generate Guest Tokens

  • Your backend API generates short-lived tokens that:
  • Identify the customer (tenant isolation)
  • Specify which lakehouse datasource to query
  • Apply row-level security filters

Node.js / Express Example:

JavaScript
// routes/analytics.js
const express = require('express');
const axios = require('axios');
const router = express.Router();

router.get('/dashboard-token', authenticateUser, async (req, res) => {
  try {
    // Your auth middleware provides req.user
    const { organizationId, region } = req.user;

    const response = await axios.post(
      'https://api.usedatabrain.com/api/v2/guest-token/create',
      {
        clientId: organizationId,
        dataAppName: 'Customer Analytics',
        datasourceName: 'Databricks Unity',  // Your lakehouse
        expiryTime: 3600000,  // 1 hour
        params: {
          rlsSettings: [{
            metricId: 'all',  // Apply to all metrics
            values: {
              tenant_id: organizationId
            }
          }]
        }
      },
      {
        headers: {
          'Authorization': `Bearer ${process.env.DATABRAIN_API_KEY}`,
          'Content-Type': 'application/json'
        }
      }
    );

    res.json({ token: response.data.token });
  } catch (error) {
    console.error('Token generation failed:', error);
    res.status(500).json({ error: 'Failed to load analytics' });
  }
});

Python / FastAPI Example:

Python
from fastapi import FastAPI, Depends, HTTPException
from fastapi.security import OAuth2PasswordBearer
import httpx
import os

app = FastAPI()
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token")

@app.get("/api/dashboard-token")
async def get_dashboard_token(token: str = Depends(oauth2_scheme)):
    # Verify user token and get organization context
    user = await verify_user_token(token)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.usedatabrain.com/api/v2/guest-token/create",
            json={
                "clientId": user.organization_id,
                "dataAppName": "Customer Analytics",
                "datasourceName": "Snowflake DWH",
                "expiryTime": 3600000,
                "params": {
                    "rlsSettings": [{
                        "metricId": "all",
                        "values": {
                            "tenant_id": user.organization_id
                        }
                    }]
                }
            },
            headers={
                "Authorization": f"Bearer {os.environ['DATABRAIN_API_KEY']}",
                "Content-Type": "application/json"
            }
        )

        if response.status_code != 200:
            raise HTTPException(
                status_code=500,
                detail="Token generation failed"
            )

        return response.json()

Step 4: Frontend Integration — Embed the Dashboard

React Component:

Vue 3 Example:

Vue
<template>
  <div class="dashboard-container">
    <div v-if="loading" class="loading-spinner">
      Loading analytics...
    </div>
    <div v-else-if="error" class="error-message">
      {{ error }}
    </div>
    <dbn-dashboard 
      v-else
      :token="token"
      dashboard-id="customer-analytics"
      theme="light"
      enable-download-csv
    />
  </div>
</template>

<script setup lang="ts">
import { ref, onMounted } from 'vue';
import '@databrainhq/plugin/web';

const token = ref<string | null>(null);
const loading = ref(true);
const error = ref<string | null>(null);

onMounted(async () => {
  try {
    const response = await fetch('/api/dashboard-token', {
      credentials: 'include'
    });
    
    if (!response.ok) throw new Error('Failed to load analytics');
    
    const data = await response.json();
    token.value = data.token;
  } catch (err) {
    error.value = err instanceof Error ? err.message : 'Unknown error';
  } finally {
    loading.value = false;
  }
});
</script>

Framework-Specific Integration:

Performance Optimization for Lakehouse Analytics

Embedding analytics on a lakehouse requires understanding lakehouse-specific optimizations:

1. Partition Pruning

Structure your lakehouse data with partition columns:

SQL
-- Good: Partitioned by tenant and date
s3://bucket/orders/
  tenant_id=acme/
    year=2025/
      month=12/
        data.parquet

DataBrain automatically includes tenant filters:

SQL
-- Your metric SQL
SELECT SUM(revenue)
FROM orders
WHERE tenant_id = {{tenant_id}}

-- Lakehouse only scans:
-- s3://bucket/orders/tenant_id=acme/
-- Ignores all other partitions → 100x faster

2. Delta Lake / Iceberg Optimizations

Enable Z-ordering on frequently filtered columns:

SQL
-- Databricks Delta Lake
OPTIMIZE orders
ZORDER BY (tenant_id, order_date);

Result: Queries filtering by tenant_id read 10-50x less data.

3. Lakehouse Result Caching

  • Snowflake: Result cache automatically shares query results across users viewing the same dashboard.
  • Databricks: Photon engine caches intermediate results for 24 hours.
  • DataBrain leverages these by generating consistent SQL queries—same metric for different users with different RLS produces cache-friendly queries.

4. Incremental Metrics

For large historical datasets, use incremental aggregation:

SQL
-- Instead of scanning all history every time
SELECT
  DATE(order_date) AS day,
  SUM(revenue)
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY day;
  • DataBrain's dynamic date filters automatically apply relative date ranges, ensuring queries stay performant as data grows.

Related Performance Content:

Cost Comparison: Lakehouse vs Traditional Warehouse for Embedded Analytics

Scenario: 500 customers, each viewing dashboards 10x/month, 1TB of analytical data

Architecture Storage Cost/Month Compute Cost/Month Total
Traditional DWH (Snowflake medium) $80 $800 (always-on warehouse) $880
Lakehouse (S3 + Databricks serverless) $23 (S3) $120 (on-demand) $143
Savings 84% reduction

Why the difference:

  • Storage: S3 is ~0.023/GB vs Snowflake's ~40/TB (with compression)
  • Compute: Serverless SQL only runs during actual queries
  • No vendor cache: Not paying to store data twice

When to Choose Lakehouse Over Traditional Database

Choose a lakehouse when:

  • You have >100GB of analytical data
  • Data arrives in batches (hourly/daily) rather than real-time transactions
  • You need to query semi-structured data (JSON, logs, events)
  • Cost optimization is critical (startup or high data volume)
  • You want to avoid vendor lock-in (open formats)

Choose traditional databases when:

  • You need sub-second updates (operational dashboards)
  • Data is  and queries are simple
  • You're already paying for a warehouse and it performs well
  • Team lacks lakehouse expertise

Best of both worlds: Use DataBrain's multi-datasource support to combine:

  • Operational DB (PostgreSQL) for real-time metrics
  • Lakehouse (Databricks) for historical trends
  • Single embedded dashboard joins both

Related Resources

Technical Deep Dive

Core Documentation

Security & Compliance

Next Steps: Start Embedding Lakehouse Analytics

Ready to embed analytics from your lakehouse?

Continue Learning

Make analytics your competitive advantage

Get it touch with us and see how Databrain can take your customer-facing analytics to the next level.

Interactive analytics dashboard with revenue insights, sales stats, and active deals powered by Databrain