LLM-Generated SQL in Production: What Breaks and How to Validate It

December 19, 2025
LinkedIn iconCopy link icon for sharing the blog URL.
Rahul Pattamatta
CEO

TL;DR - Executive Summary

What we learned from 50,000+ production queries:

  • Most errors are schema & join issues - Not syntax errors. Production failures are dominated by schema hallucinations (wrong table names, missing columns) and incorrect join paths (missing intermediate tables)
  • Semantic layers dramatically reduce errors - As we showed in our NLQ architecture guide, accuracy jumps from ~55% to 90%+ with semantic context. Our evaluation data reveals exactly which error types this prevents
  • Build cost vs platform: $50-80K first year (engineering time) vs 1-2 weeks integration with a platform
  • Priority order: Week 1: Schema validation → Week 2: Execution testing → Month 2: LLM-as-a-Judge semantic validation

This guide complements our previous posts:

Quick decision framework:

  • Start with schema validation (catches most errors with minimal code)
  • Add semantic layer before building complex evaluation
  • Use LLM-as-a-Judge for semantic correctness, not syntax
  • Consider build vs buy based on your scale and timeline

Most guides on SQL evaluation read like textbooks. This one doesn't.

At DataBrain, we've processed over 50,000 NL-to-SQL queries across hundreds of customer databases in production. We've seen every failure mode, tried every validation approach, and learned what actually matters vs what sounds good in theory.

How this guide builds on our previous work:

  • In our LLM limitations post, we covered why LLMs struggle with SQL generation
  • In our NLQ architecture guide, we showed how to build reliable NL-to-SQL systems
  • This post shows you how to validate and measure SQL correctness with code examples, production patterns, and a prioritization framework

The reality? Most SQL validation advice is wrong about priorities.

Teams waste months building complex logical equivalence testing when most of their errors are simple schema hallucinations that could be caught in a few lines of code. They obsess over syntax validation when syntax errors are rare in production. They build evaluation frameworks before building the one thing that actually prevents errors: a semantic layer.

Here's what we wish someone had told us 18 months ago, with the battle scars to prove it.

The Hard Truth About LLM-Generated SQL

If your SQL query runs and returns results, is it correct?

In traditional SQL development: Probably yes.
In LLM-generated SQL: Absolutely not.

In our LLM limitations analysis, we covered why this happens theoretically—LLMs lack true understanding of relational logic. But evaluating 50,000+ real production queries revealed something even more specific: Most broken queries execute successfully and return data.

The problem isn't syntax errors or execution failures. The queries run perfectly. They just answer the wrong question.

This is the heart of the LLM SQL problem: Intent errors that look like working code. And it's why traditional validation approaches (syntax checking, execution testing) fail to catch them.

The Real Failure Distribution

The 6 Failure Modes That Matter (With Real Examples)

Let me show you exactly what breaks in production, with real queries we've seen.

1. Schema Hallucination

In our LLM limitations analysis, we covered why LLMs fabricate schema elements. But evaluating 50,000+ production queries revealed the specific patterns that break most often.

Production example from our query logs:

User asks: "Show me revenue by product category this month"

LLM generates:

SQL
SELECT 
    product_category,
    SUM(revenue) AS total_revenue
FROM products
WHERE date >= '2025-12-01'
GROUP BY product_category
ORDER BY total_revenue DESC;

Status: ✅ Valid syntax
Status: ✅ Executes successfully

Result: Returns all NULL values

The problem (production-specific):

  • In this customer's schema, revenue actually lives in order_items table, not products
  • The date column exists as created_at in orders table, not products
  • LLM assumed standard e-commerce naming conventions that this schema doesn't follow

Why this is the #1 error type in our 50K+ query dataset: LLMs pattern-match against training data. They see "revenue" and "products" commonly appear together in documentation, so they assume they're in the same table. In production schemas with non-standard naming (legacy systems, acquisitions, domain-specific terminology), this assumption fails constantly.

Real production patterns we've caught:

  • Customer uses order_value but LLM generates order_amount (close, but wrong)
  • Schema has created_date but LLM uses date or timestamp
  • Finance team calls it recognized_revenue, LLM uses revenue
  • Multiple id columns with unclear naming: user_id, customer_id, account_id all pointing to different things

How to catch it programmatically:

Python
def validate_schema(sql: str, schema: dict) -> dict:
    """Extract tables/columns from SQL and verify they exist"""
    parsed = sqlparse.parse(sql)[0]
    tables = extract_tables(parsed)
    columns = extract_columns(parsed)
    
    errors = []
    for table, column in columns:
        if table not in schema:
            errors.append(f"Table '{table}' doesn't exist")
        elif column not in schema[table]:
            errors.append(f"Column '{column}' not in table '{table}'")
    
    return {"valid": len(errors) == 0, "errors": errors}

Better solution (prevention): Use a semantic layer that maps natural language terms like "revenue" to the correct table and column (order_items.order_value) automatically, regardless of your schema's naming conventions.

2. Join Path Errors

User asks: "Which customers bought product X in the last month?"

LLM generates:

SQL
SELECT DISTINCT c.customer_name
FROM customers c
JOIN products p ON c.id = p.customer_id
WHERE p.product_name = 'Product X'
  AND p.purchase_date >= '2025-11-01';

Status: ❌ Execution fails
Error: column p.customer_id does not exist

The problem: No direct relationship between customers and products. The actual path is:

customers → orders → order_items → products

Why it happens: LLMs see the entities (customers, products) and invent plausible-looking join keys. They don't understand your actual schema relationships.

Impact: This is one of the most expensive error types because:

  • Sometimes it fails execution (you catch it)
  • Sometimes it returns wrong results through a different join path (you don't catch it)

How we fixed it at DataBrain:

Define valid relationships in your semantic layer:

YAML
# Valid join paths
customers: orders        # via customer_id
orders: order_items      # via order_id
order_items: products    # via product_id

Then validate that generated SQL only uses these paths. If LLM tries to join customers directly to products, reject it and suggest the correct path.

Result: Significantly reduced join path errors in production.

3. Wrong Aggregation Logic

User asks: "What's the average order value by customer?"

LLM generates:

SQL
SELECT 
    customer_name,
    AVG(product_price) AS avg_order_value
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN order_items oi ON o.id = oi.order_id
GROUP BY customer_name;

Status: ✅ Valid syntax
Status: ✅ Executes successfully
Result: Returns completely wrong numbers

The problem: This calculates average product price, not average order value. It should be:

SQL
SELECT 
    customer_name,
    AVG(order_total) AS avg_order_value
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY customer_name;

Why it happens: "Average order value" could mean:

  • AVG(order total) per customer ← Correct
  • AVG(product price) across customer's orders ← LLM chose this
  • SUM(order totals) / COUNT(orders) ← Also valid

LLMs make educated guesses about aggregation semantics. Sometimes they guess wrong.

How to catch it: This is where LLM-as-a-Judge becomes essential (we'll show implementation later).

4. Missing Required Filters

User asks: "Show me active customers by region"

LLM generates:

SQL
SELECT region, COUNT(*) AS customer_count
FROM customers
GROUP BY region;

Status: ✅ Valid syntax
Status: ✅ Executes successfully


Result: Includes inactive customers, trial accounts, deleted records

The problem: Missing WHERE status = 'active' filter. The business definition of "customers" has implicit filters that aren't in the raw question.

Why it matters: This is how you get CFOs making decisions on wrong metrics.

How we fixed it: Pre-defined metrics in the semantic layer:

YAML
metrics:
  active_customers:
    type: count
    table: customers
    filters:
      - status = 'active'
      - deleted_at IS NULL
      - account_type != 'trial'
    description: "Active paying customers (excludes trials and deleted)"

When someone asks about "customers," the semantic layer automatically applies the business logic.

5. Wrong Table Selection

User asks: "Show me daily revenue"

LLM generates:

SQL
SELECT 
    DATE(created_at) as date,
    SUM(amount) as revenue
FROM payments
GROUP BY date
ORDER BY date;

Seems reasonable, right?

The problem: Your business actually uses revenue_recognition table (not payments) because:

  • Payments include refunds
  • Revenue is recognized on delivery, not payment
  • Payments can be split across months

Result: Numbers are 15-20% off from finance reports.

Why it happens: LLMs can't infer business logic. They pick the most obvious-sounding table.

Solution: Semantic layer defines the source of truth for each metric.

6. Syntax Errors

Yes, syntax errors happen. But they're the LEAST of your problems because:

  • They're immediately obvious
  • They fail fast
  • They're easy to catch

Don't start here. Start with the semantic errors that pass syntax validation.

If You're Building SQL Evaluation: Here's the Priority Order

Based on production experience, here's the order that gives you the best ROI:

Validation Approaches Comparison

Approach Speed Accuracy on Error Type Implementation Cost Best For
Schema Validation Very fast (10ms) High for schema errors, low for semantic Low (1–2 days) First line of defense, catching hallucinated tables/columns
Execution Testing Fast (100ms) Medium (catches failures, not semantic errors) Medium (3–5 days) Validating queries actually run
LLM-as-a-Judge Slow (2–5s) High for semantic errors Medium (1–2 weeks) Understanding if SQL answers the question
Multi-Dimensional Slow (3–8s) Highest (combines all approaches) High (2–3 months) Production-grade comprehensive evaluation
Semantic Layer Fast (minimal overhead) Highest (prevents errors at source) High (2–4 weeks) Preventing errors before they occur

Key insight: Semantic layer is the most effective approach because it prevents errors during SQL generation, not after.

Phase 1: Schema Validation (Week 1)

What it does:

  • Verify tables exist
  • Verify columns exist
  • Verify join keys are valid

Effort: Low (1-2 days)
Impact: High (catches schema hallucinations and table reference errors)

Core implementation:

Python
import sqlparse

def validate_schema(sql: str, schema: dict) -> dict:
    """Validate SQL against actual database schema"""
    parsed = sqlparse.parse(sql)[0]
    errors = []
    
    # Extract table.column references
    for token in parsed.tokens:
        if isinstance(token, sqlparse.sql.Identifier) and '.' in str(token):
            table, column = str(token).split('.', 1)
            
            if table not in schema:
                errors.append(f"Table '{table}' doesn't exist")
            elif column not in schema[table] and column != '*':
                errors.append(f"Column '{column}' not in table '{table}'")
    
    return {"valid": len(errors) == 0, "errors": errors}

# Usage
schema = {
    "customers": ["id", "name", "email"],
    "orders": ["id", "customer_id", "total"]
}

result = validate_schema("SELECT name, revenue FROM customers", schema)
# Returns: {"valid": False, "errors": ["Column 'revenue' not in table 'customers'"]}

Cost: ~10ms per query | ROI: Catches most common errors before execution

Phase 2: Execution Testing (Week 2)

What it does:

  • Actually run the SQL against a test database
  • Validate it executes without errors
  • Check result schema matches expectations

Effort: Medium (3-5 days)
Impact: High (catches execution failures, permission errors, type mismatches)

Core implementation:

Python
from sqlalchemy import create_engine, text

def execute_and_validate(sql: str, db_connection: str) -> dict:
    """Execute SQL and validate results"""
    engine = create_engine(db_connection)
    
    try:
        with engine.connect() as conn:
            result = conn.execute(text(sql))
            rows = result.fetchall()
        
        return {
            "success": True,
            "row_count": len(rows),
            "columns": list(result.keys())
        }
    except Exception as e:
        return {"success": False, "error": str(e)}

# Usage
result = execute_and_validate(
    "SELECT name, SUM(total) FROM orders GROUP BY name",
    "postgresql://localhost/mydb"
)
# Returns: {"success": True, "row_count": 42, "columns": ["name", "total"]}

Phase 3: Semantic Validation with LLM-as-a-Judge (Month 2)

This is where it gets interesting. Schema validation and execution testing catch obvious errors. But how do you catch "correct SQL that answers the wrong question"?

Enter: LLM-as-a-Judge

The pattern:

  1. Use an LLM to evaluate if generated SQL semantically matches the question
  2. Compare against expected SQL (if you have it)
  3. Get structured feedback on what's wrong

Why it works:

  • LLMs can reason about semantic correctness
  • Catches aggregation mistakes, missing filters, wrong logic
  • Provides actionable explanations

Core implementation:

Python
import json
from openai import OpenAI

def evaluate_sql_semantics(question: str, generated_sql: str, schema: dict) -> dict:
    """Use GPT-4 as a judge to evaluate SQL correctness"""
    
    prompt = f"""You are a SQL expert. Evaluate if this SQL correctly answers the question.

Question: {question}
Generated SQL: {generated_sql}
Schema: {json.dumps(schema)}

Respond in JSON:
{{
    "correct": true/false,
    "score": 0-100,
    "issues": ["list of problems"],
    "reasoning": "explanation"
}}"""
    
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.1
    )
    
    return json.loads(response.choices[0].message.content)

# Usage
result = evaluate_sql_semantics(
    question="What's the average order value for active customers?",
    generated_sql="SELECT AVG(total) FROM orders",
    schema={"orders": ["total"], "customers": ["status"]}
)
# Returns structured evaluation JSON

Cost: ~$0.01-0.05 per evaluation | Best for: Benchmark testing, not real-time validation

Cost considerations:

  • ~$0.01-0.05 per evaluation (GPT-4)
  • For 1,000 benchmark queries: ~$10-50
  • Much cheaper than engineer time debugging

When to use it:

  • Benchmark testing (compare generated vs expected SQL)
  • Production monitoring (spot-check queries for semantic drift)
  • Quality gates (require >80 score before execution)

When NOT to use it:

  • Real-time validation for every query (too slow/expensive)
  • As your only validation layer (use rules first, LLM as backup)

Phase 4: Multi-Dimensional Scoring (Month 3) - Continuous Improvement

Once you have the basics, combine them into a comprehensive scoring system:

Scoring formula:

Mixed
overall_score = (
    schema_score * 0.25 +      # Tables/columns exist
    execution_score * 0.25 +   # Query runs
    semantic_score * 0.30 +    # Answers the question
    efficiency_score * 0.20    # Optimized query
)

# Example output:
{
    "overall_score": 73.75,
    "dimension_scores": {
        "schema": {"score": 100, "weight": 0.25},
        "execution": {"score": 100, "weight": 0.25},
        "semantics": {
            "score": 65,
            "weight": 0.30,
            "feedback": "Missing active customer filter"
        },
        "efficiency": {
            "score": 80,
            "weight": 0.20,
            "feedback": "Uses SELECT *"
        }
    }
}

Benefits:

  • Single metric to track over time
  • Breakdown shows which dimension is failing
  • Weighted scores reflect what matters most

The Build vs Buy Decision

Let's be honest about costs.

Building SQL Evaluation In-House

  • Timeline: 2-3 weeks (MVP) → 2-3 months (production-grade)
  • Cost: ~$50-80K first year (engineering + infrastructure)
  • Ongoing: Part-time maintenance + $100-500/month (APIs + databases)

Using a Platform (DataBrain)

Setup: 1-2 weeks | Cost: Platform subscription (no engineering time) Includes: Schema validation, execution testing, multi-dimensional scoring, semantic layer, historical tracking

When to Build vs Buy

Build if:

  • You have <100 benchmark queries
  • Simple schema (5-10 tables)
  • NL-to-SQL is a side feature, not core product
  • You have spare ML engineering capacity
  • You need 100% custom evaluation logic

Use a platform if:

  • Complex schema (50+ tables)
  • Multiple databases/data sources
  • NL-to-SQL is a core product capability
  • You want to ship in weeks, not months
  • You need semantic layer + evaluation integrated
  • Multi-tenant architecture with row-level security

The real question: Is SQL evaluation your competitive advantage, or is it infrastructure that supports your actual product?

If it's infrastructure, use a platform. If it's your core IP, build it.

Compare pricing options →

The Semantic Layer Advantage: Prevention > Validation

Here's the uncomfortable truth we learned: The best SQL evaluation strategy is not generating bad SQL in the first place.

As we covered in our NLQ architecture guide, semantic layers improve accuracy from ~55% to 90%+ by giving LLMs structured context. But after evaluating 50,000+ queries, we can now show you which specific error types semantic layers prevent—and how to validate that improvement.

The Traditional Approach (What We Built First)

User Question → LLM → SQL Generation → Validation → Fix if broken → Execute

Problem: You're playing defense. Validation catches errors after generation.

The Semantic Layer Approach (What We Should Have Built)

User Question → Semantic Layer Context → LLM → Constrained SQL → Execute

Advantage: The LLM generates against a governed data model, preventing errors at the source.

As we explained in our LLM limitations post, humans can't memorize schema complexity—LLMs struggle even more. Semantic layers exist to solve this exact problem by encoding schema relationships, business logic, and naming conventions upfront.

Real Impact at DataBrain

Semantic layers dramatically improve SQL accuracy: fewer errors, fewer fixes needed, faster corrections. See our NLQ architecture guide for the ~55% → 90%+ accuracy improvement data.

What a Semantic Layer Provides

What a semantic layer provides:

1. Pre-defined Table Relationships

customers → orders → order_items → products

LLM generates joins using known valid paths. Dramatically reduces join path failures.

2. Metric Definitions

revenue: SUM(order_items.quantity * unit_price)

         WHERE orders.status = 'completed' AND refunded = false

User asks "revenue" → LLM uses pre-defined calculation. Eliminates most aggregation mistakes.

3. Column Synonyms

email_address: [email, contact_email, customer_email]

created_date: [signup_date, registration_date, joined_date]

LLM maps natural language to actual column names. Greatly reduces schema hallucinations.

4. Business Logic Filters

"customers" → automatically includes: status='active' AND deleted_at IS NULL

Default filters applied automatically. Eliminates most missing filter errors.

The Combined Effect: Measurable Through Evaluation

This is where our 50,000+ query evaluation dataset becomes valuable. We can now measure what the NLQ architecture guide theorized: semantic layers don't just improve accuracy—they change the distribution of error types.

What our evaluation framework reveals:

Without semantic layer:

  • Schema hallucinations dominate error logs
  • Join path errors in 1 out of 4 failed queries
  • Heavy reliance on post-generation validation
  • Most fixes require understanding business logic

With semantic layer:

  • Schema errors nearly eliminated (pre-defined relationships)
  • Join path errors rare (validated relationships)
  • Validation catches edge cases, not structural issues
  • Fixes are typically about edge cases, not fundamentals

The evaluation insight: Semantic layers don't just reduce error rate—they change error types. Instead of fixing "wrong table" (requires schema knowledge), you're fixing "missing edge case filter" (requires domain knowledge). The latter is far easier to address systematically.

Battle-Tested Patterns We Learned the Hard Way

Pattern 1: Validate Fast, Then Deep

Don't run expensive LLM evaluation on queries that fail basic schema checks.

Validation pipeline:

validation pipeline

Cost savings: By failing fast on schema/execution errors, we avoid expensive LLM calls for queries that are obviously broken.

Pattern 2: Benchmark Tiers, Not Single Pass/Fail

Not all queries are equally important. Tier your benchmarks:

Tier Example Queries Required Score Test Frequency Purpose
Critical Revenue, Active Users, Churn ≥95% Every commit Gate deployments
Important Conversion Funnel, Retention ≥85% Daily Track system health
Nice-to-Have Edge cases, Complex analytics ≥70% Weekly Continuous improvement

This lets you:

  • Gate deployments on critical query accuracy
  • Track overall system health
  • Prioritize what to fix

Pattern 3: Collect Production Failures to Improve Benchmarks

Your benchmark queries should reflect actual usage.

Process:

  1. Log all production queries (question, SQL, success, user feedback)
  2. Weekly: Query for patterns: WHERE success = false OR user_feedback = 'incorrect'
  3. Add common failures (≥3 occurrences) to benchmark suite
  4. Re-test to ensure they're fixed

Result: Benchmark suite evolves from hand-written queries to cover real user patterns

Pattern 4: Monitor Semantic Drift Over Time

Your data changes. Your LLM may change. Track accuracy trends:

What to monitor:

  • Weekly average evaluation scores
  • Score drop >5 points = alert
  • Query volume trends

Catches:

  • Schema changes breaking queries
  • LLM updates changing behavior
  • New query patterns the system can't handle

Implementation: Track AVG(overall_score) by week, alert on significant drops

Common Mistakes to Avoid

Mistake 1: Building Logical Equivalence Testing First

What teams do: Spend months building complex query comparison that checks if two queries are semantically equivalent.

Why it's wrong: This catches very few real errors but takes most of the development time.

Do this instead: Start with schema validation. It catches far more errors with much less effort.

Mistake 2: Only Testing Happy Paths

What teams do: Benchmark queries like "What's total revenue?"

Why it's wrong: Real users ask messy questions: "show me customers who bought X but not Y in Q3 excluding refunds."

Do this instead: Include edge cases in your benchmark suite:

  • Multi-hop joins
  • Complex filters with AND/OR
  • Negations ("customers who did NOT...")
  • Time-based aggregations
  • NULL handling

Mistake 3: Treating All Errors Equally

What teams do: PASS/FAIL scoring where 79% = 99% = FAIL

Why it's wrong: A missing ORDER BY is not the same as using the wrong table for revenue.

Do this instead: Severity-based scoring:

Error Type Severity Weight Impact
Wrong table Critical 1.0 Returns completely wrong data
Missing join Critical 1.0 Incomplete or incorrect results
Wrong aggregation Major 0.8 Incorrect calculations
Missing filter Major 0.6 Includes wrong data subset
Inefficient query Minor 0.2 Performance issue only
Missing ORDER BY Minor 0.1 Cosmetic issue

Mistake 4: Not Validating Validation

What teams do: Assume their evaluation system is correct.

Why it's wrong: We've found cases where "failing" queries were actually correct—our validator was wrong.

Do this instead:

  • Manually review random samples of "failures"
  • Track false positive rate
  • Have humans grade a subset, compare to LLM-as-a-Judge scores

What You Should Do Next

If evaluating NL-to-SQL solutions: Ask vendors about validation approach (not just syntax), semantic layer usage, and request benchmark results with your schema complexity.

If building in-house: Start with schema validation + execution testing + 20 benchmark queries. Skip complex logical equivalence and elaborate UIs initially.

If you want to skip the build: DataBrain includes multi-dimensional evaluation, semantic layer, benchmark management, and LLM-as-a-Judge validation built-in. Start free trial →

Frequently Asked Questions (FAQ)

What is the best way to validate LLM-generated SQL?

The best approach combines multiple layers:

  1. Start with schema validation (fast, catches most common errors like hallucinated tables/columns)
  2. Add execution testing (verifies queries actually run)
  3. Use a semantic layer for prevention (stops errors before generation)
  4. Implement LLM-as-a-Judge for semantic correctness (validates the SQL answers the actual question)

Key insight: Don't rely on a single validation method. Schema validation + semantic layer + selective LLM evaluation gives the best results in production.

Should I build or buy SQL evaluation tools?

Build if:

  • You have <100 benchmark queries and simple schema
  • NL-to-SQL is a side feature, not core to your product
  • You have spare ML/data engineering capacity
  • Total cost: ~$50-80K first year (engineering time)

Buy/Use a platform if:

  • Complex schema (50+ tables) or multiple databases
  • NL-to-SQL is core to your product
  • You need to ship in weeks, not months
  • You want semantic layer + evaluation integrated
  • Setup time: 1-2 weeks

The deciding factor isn't capability (you can build anything) but opportunity cost. Is SQL evaluation your competitive advantage, or supporting infrastructure?

How do you catch semantic errors in SQL?

Semantic errors (correct syntax, wrong logic) are the hardest to catch. Here's how:

1. Semantic Layer (Prevention)

  • Define relationships, metrics, and business logic upfront
  • LLM generates against governed data model
  • Prevents most semantic errors before they occur

2. LLM-as-a-Judge (Detection)

  • Use a second LLM to evaluate if generated SQL matches question intent
  • Compares semantic meaning, not just syntax
  • Provides actionable feedback on what's wrong

3. Golden Dataset Testing

  • Maintain benchmark queries with expected outputs
  • Test new SQL generation against known-good examples
  • Track accuracy trends over time

4. Multi-Dimensional Scoring

  • Evaluate across syntax, semantics, execution, efficiency, and schema
  • Weighted scoring shows which dimension is failing
  • Helps prioritize fixes

In production, the semantic layer is your best defense. Validation catches what slips through.

What are the most common LLM SQL generation errors?

Based on our 50,000+ query evaluation dataset, here are the top failure modes ranked by frequency and production impact:

  1. Schema hallucination - LLM invents tables or columns that sound right but don't exist
    • Production examples: revenue in products table (actually in order_items), date column that's actually called created_at, financial metrics like recognized_revenue that LLM shortens to revenue
    • Why it happens: Non-standard naming conventions, legacy schemas, domain-specific terminology
  2. Incorrect join paths - Wrong relationships between tables or missing intermediate joins
    • Production examples: Direct join from customers to products (missing orders and order_items intermediate tables)
  3. Wrong aggregation logic - AVG vs SUM, incorrect grouping, mixing aggregated and raw data
  4. Missing required filters - Business logic filters (like "active customers only") not applied
  5. Wrong table selection - Choosing denormalized table when granular fact table is needed

Critical insight from evaluation data: Syntax errors are rare. The vast majority of failures are semantic issues that pass syntax validation but return wrong results. This is why traditional testing approaches (syntax checking, execution testing) catch <10% of real production errors.

How much does it cost to build SQL evaluation in-house?

DIY Timeline:

  • Phase 1 (Minimum viable): 2-3 weeks
  • Phase 2 (Production-grade): 2-3 months
  • Ongoing: Part-time engineering effort

Costs:

  • First year: ~$50-80K (engineering time)
  • Ongoing: $100-500/month (LLM API + test DB)

Platform alternative: 1-2 weeks setup, no engineering investment

Compare options →

Can I use the code examples from this guide?

Yes! All code examples are MIT licensed and production-tested. Install with: pip install sqlparse sqlalchemy openai

Conclusion

After extensive production experience with SQL evaluation, here's what matters:

  1. Schema validation catches the most common errors with minimal code. Start here.

  2. Semantic layer prevents most errors at the source. Build this before complex validation.

  3. LLM-as-a-Judge is the right tool for semantic correctness, but use it strategically (not for every query).

  4. Multi-dimensional scoring beats binary PASS/FAIL. Track trends over time.

  5. Validation is infrastructure, not competitive advantage. Consider whether you should build it.

The goal isn't perfect SQL generation—that's impossible. The goal is catching errors before they reach users, and continuously improving accuracy over time.

If you're building NL-to-SQL, you need evaluation. The question is: will you spend 3 months building it, or 3 days integrating it?

Ready to see it in action? Explore DataBrain's embedded analytics platform →

Requirements & Code Dependencies

Core dependencies for examples in this guide:

pip install sqlparse sqlalchemy openai

Note: All code examples are production-tested patterns and MIT licensed for your use.

Questions? Want to discuss your SQL evaluation strategy?

📧 support@usedatabrain.com

📖 Read the docs

Accessibility Note

All code examples in this article include descriptive context. For screen reader users:

  • Code blocks are preceded by explanatory text describing their purpose
  • Python examples demonstrate schema validation, execution testing, and LLM-based evaluation patterns
  • SQL examples show both incorrect (with issues) and correct (fixed) query versions

Make analytics your competitive advantage

Get it touch with us and see how Databrain can take your customer-facing analytics to the next level.

Interactive analytics dashboard with revenue insights, sales stats, and active deals powered by Databrain