Building Agentic Text-to-SQL: Why RAG Fails on Enterprise Data Lakes

I spent this weekend building a text-to-SQL system that actually works. Not a demo that falls apart when you ask it something tricky — a real system that explores databases, fixes its own mistakes, and costs 85% less than the standard approach.

The spark was a recent paper on “Agentic NL2SQL” that claimed you could cut LLM token usage by 87% while keeping accuracy. The researchers tested on simulated enterprise databases. I wanted to see if that holds up on real cloud infrastructure with real constraints.

So I picked Snowflake (a data lake platform in practice), loaded the Spider 2.0 dataset with 100+ databases and thousands of tables, and started hacking. The catch? I was on Snowflake’s free tier with queued access, and the dataset had zero documentation. If the agentic approach works here, it works anywhere.

Image generated by author using AI

The Problem Everyone’s Ignoring

Most production text-to-SQL systems still follow the “Direct Solver” approach: dump the entire schema into a prompt and pray the LLM figures it out.

This works for small databases (10–20 tables). But beyond that, you hit a performance cliff:

LLM reasoning degrades as prompt size grows.
Beyond ~1,000 tokens, tabular reasoning starts to collapse.
Enterprise lakes routinely require 50,000+ tokens just for metadata.

In the Datalake Agent paper, Direct Solver performance cracked at 159 tables and collapsed at 319. Flood the model with irrelevant metadata, and your signal gets drowned.

Why I Tried RAG First (And Why It Failed)

Problem 1: You’re Still Drowning in Context

Even if RAG retrieves the top 20 tables, you’re still stuffing 1,000+ columns into the prompt. You shrank the haystack — you didn’t solve it.

Problem 2: Schemas Don’t Embed Well

Table names like dim_cust_geo_v2 are meaningless until you inspect columns and sample data. Embeddings can’t recover structure without peeking.

Problem 3: Ambiguity Kills

A column called city could be a string, JSON, foreign key, or array. RAG makes the model guess the shape — and guesses fail.

Problem 4: No Descriptions = No Semantics

In real data lakes, tables are cryptic and undocumented. Semantic search becomes a coin flip.

What Actually Works: Hierarchical Discovery

Instead of dumping everything at once, an agent explores the lake like a human analyst:

List databases
Pick the most relevant one
List schemas and tables
Deep dive into 3–5 candidates
Peek at sample rows before writing SQL

Traditional: Dump 3,000 tables → 50,000 tokens → Generate SQL → Cross fingers

Agentic: Discover hierarchically → ~650 tokens total → Generate SQL with evidence

Building the Snowflake Agent

Snowflake is a great stress test: case sensitivity, JSON columns, multi-schema layouts, and I was running on free tier queueing. If an agent survives this, it survives enterprise reality.

The Discovery Flow

Step 1: Database discovery
Step 2: Schema exploration
Step 3: Table filtering
Step 4: Deep dive with schema + sample rows

Agent explores databases in strict hierarchy

Demo (Real World)

Real world Demo

In the demo, I first asked for the number of databases and top 20 database details. Then I asked:

“Show me the list of airports from the city Moscow.”

The agent correctly navigated to the airlines database, made tool calls (list schema → list tables → get details with parallel peek → execute → self-heal), then returned the final answer.

The Critical Innovation: The Peek

For each table the agent selects, it fetches:

Schema (columns + types)
One sample row

That single row reveals the real structure. Example:

Example Data


{"en": "St. Petersburg", "ru": "Санкт-Петербург"}

Without the peek, the model writes a naive filter and fails on JSON. With it, it uses PARSE_JSON(city):en::string and succeeds.

The Latency Problem (And How I Fixed It)

Querying INFORMATION_SCHEMA sequentially took 2+ minutes on free tier. So I parallelized schema+peek per table.

Python


from concurrent.futures import ThreadPoolExecutor, as_completed

def get_table_details(database, schema, table_names):
    def describe_and_peek(table_name):
        results = []

        desc_query = f"DESCRIBE TABLE {database}.{schema}.{table_name}"
        desc_df = execute_query(desc_query)
        results.append(f"=== {table_name} Schema ===")
        results.append(desc_df.to_string())

        try:
            sample_query = f"SELECT * FROM {database}.{schema}.{table_name} LIMIT 1"
            sample_df = execute_query(sample_query)
            results.append("=== Sample Row ===")
            results.append(sample_df.to_string())
        except Exception as e:
            results.append(f"Couldn't fetch sample: {str(e)}")

        return "\n".join(results)

    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [executor.submit(describe_and_peek, t) for t in table_names]
        return "\n\n".join([f.result() for f in as_completed(futures)])

Discovery time dropped from 120+ seconds to under 3 seconds.

Self‑Healing SQL

SQL is unforgiving. So instead of failing loudly, the agent retries with error-aware corrections:

Python


def execute_with_retry(sql, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return snowflake_client.execute_query(sql)
        except Exception as error:
            if attempt == max_attempts - 1:
                return f"Failed after {max_attempts} attempts: {error}"

            correction_prompt = f'''
Your SQL failed with this error:
{error}

Original SQL:
{sql}

Analyze the error and generate corrected SQL.
'''
            sql = agent.generate_sql(correction_prompt)

What I Learned

Performance Numbers

Average query time: 8–15 seconds (discovery + execution)
Token reduction: ~85% vs full-schema prompting
First-try success: 78%
After self-correction: 94%
Parallel speedup: ~40× metadata fetching

Why This Approach Wins

The future isn’t better retrieval — it’s better reasoning. Agentic exploration beats passive RAG for structured lakes at scale.

Pros & Cons

✓ Pros

• Massive token savings in large lakes
• Handles undocumented / cryptic schemas
• “Peek” removes column ambiguity
• Self-healing improves reliability
• Parallel discovery makes it usable

✗ Cons

• Requires tool access to metadata + samples
• More moving parts than simple RAG bots
• Needs guardrails to avoid exploration loops

Try It Yourself

Full source code and examples available on GitHub

View on GitHub →

Need Help Implementing This?

I offer consulting services for Text-to-SQL systems. Let's discuss your requirements.

Schedule a Consultation

References

“Agentic NL2SQL to Reduce Computational Costs” — Jehle et al., NeurIPS 2025 (arXiv:2510.14808)