Building SQL-of-Thought: Multi-Agentic Text-to-SQL with Guided Error Correction

Image Generated by Author Using AI

You ask your database: "Show me customers who spent more than average last quarter."

The AI generates SQL. Executes it. Returns an error.

Tries again. Same error, different phrasing.

Third attempt. Still broken.

This happens constantly with Text-to-SQL systems. Even GPT-4 fails on complex joins, ambiguous columns, and aggregation logic. Most systems retry blindly, making the same mistakes in slightly different ways.

A recent paper from Max Planck Institute and AWS GenAI introduced SQL-of-Thought, a multi-agent framework that categorizes 31 types of SQL errors and fixes them systematically. It hits 91.59% accuracy on the Spider benchmark.

Instead of just explaining the research, I built a working demo to see how this approach handles real errors.

Here's what I learned.

The Problem with Single-Agent Text-to-SQL

Traditional approaches use one LLM call to generate SQL directly from natural language.

When the query fails, they retry. Maybe adjust the prompt. Try again.

No systematic understanding of what went wrong. No targeted fix based on error type.

Common failures include:

Schema mismatches (wrong table or column names)
Ambiguous column references (same name in multiple tables)
Missing joins between related tables
Incorrect aggregation functions
Wrong filter conditions

The paper's insight: categorize errors into 31 types across 9 categories, then apply targeted corrections based on error classification.

Understanding the Paper's Architecture

Before diving into the demo, here's how SQL-of-Thought actually works.

Figure 1 from Research Paper - Multi-Agent Pipeline

The Multi-Agent Pipeline

The framework orchestrates six specialized agents in sequence:

Schema Linking Agent receives the natural language question and database schema, then identifies relevant tables, columns, primary keys, and foreign key relationships.
Subproblem Agent takes the schema-linked output and breaks the query into clause-level components: WHERE conditions, JOIN requirements, GROUP BY aggregations, ORDER BY sorting, HAVING filters, LIMIT clauses.
Query Plan Agent uses chain-of-thought reasoning to create a step-by-step execution plan. This agent thinks through the problem procedurally before writing any SQL.
SQL Agent converts the query plan into executable SQL, handling syntax and formatting.
DB Execution Engine runs the query against the database.
Correction Loop activates on failure:
- Correction Plan Agent analyzes the error using the taxonomy, identifies root cause, generates fix strategy
- Correction SQL Agent regenerates query based on correction plan
- Loop repeats until success or max attempts reached

The key innovation: when errors occur, the system doesn't just retry. It diagnoses the error type and applies a targeted fix.

The Error Taxonomy

Figure 2 from Research Paper - Error Taxonomy

The correction system categorizes SQL failures into 9 major categories with 31 specific error types.

table 1 from Research Paper - Error Taxonomy

Each error code provides concise identification without overwhelming the LLM's context window.

Why This Taxonomy Matters

Traditional systems see only execution errors: "column not found" or "ambiguous reference." They retry with slight variations, often repeating the same mistake.

SQL-of-Thought categorizes the error type and applies the known solution pattern. If it's ambiguous_col, add table prefixes. If it's join_missing, identify the join path through foreign keys. If it's agg_no_groupby, add GROUP BY for non-aggregated columns.

The taxonomy transforms debugging from guesswork into systematic diagnosis.

Building the Demo

I wanted to see this work with real data and real errors.

Database Choice

I used Chinook, a SQLite database modeling a music store. It has 11 tables: Artist, Album, Track, Invoice, Customer, Employee, and more. Realistic complexity with foreign keys and multiple join paths.

Tech Stack

Backend: Node.js + Express + DuckDB (SQLite compatibility)
Frontend: Vite + vanilla JavaScript
LLM: OpenAI GPT-4o-mini (fast and cost-effective)
Real-time updates: Server-Sent Events to stream each agent step

The 6 Specialized Agents

Agent Architecture - Image by Author

Following the paper's architecture, I built six specialized agents in Node.js:

1. Schema Linking Agent

Identifies relevant tables and columns from the database schema.

Takes the natural language question and full schema, and returns only the subset needed for the query.

For "top selling tracks," it identifies Track, InvoiceLine, Album, and Artist tables with their relationships.

2. Subproblem Agent

Breaks the question into SQL clause components.

Takes the question and linked schema, outputs structured breakdown of SELECT, FROM, JOIN, WHERE, GROUP BY, ORDER BY, LIMIT clauses.

Example breakdown:

{
  "SELECT": "Track.Name, SUM(revenue)",
  "FROM": "Track, InvoiceLine",
  "JOIN": "Track.TrackId = InvoiceLine.TrackId",
  "GROUP BY": "Track.Name",
  "ORDER BY": "revenue DESC",
  "LIMIT": "5"
}

3. Query Plan Agent (Chain-of-Thought)

Creates step-by-step execution plan with reasoning for each action.

Takes question, schema, and subproblems, outputs numbered steps explaining the logic:

Step 1: Start with Track table (contains track names)
Step 2: JOIN InvoiceLine (links tracks to revenue via TrackId)
Step 3: Calculate SUM(UnitPrice × Quantity) for total revenue
Step 4: GROUP BY Track.Name to aggregate per track
Step 5: ORDER BY revenue DESC, LIMIT 5 for top results

This reasoning step prevents logical errors before SQL generation.

4. SQL Generation Agent

Converts the plan into executable SQL.

Takes the query plan and schema, outputs syntactically correct SQL with proper table aliases and join conditions.

5. SQL Execution

Runs the generated SQL against DuckDB.

Success returns results. Failure captures the error message and triggers correction.

6. Correction Agents (The Critical Innovation)

When execution fails, two agents collaborate:

Correction Plan Agent runs DESCRIBE queries on problematic tables to inspect the actual schema. It classifies the error using the taxonomy (schema_link.ambiguous_col, join.wrong_condition, etc.) and identifies root cause with fix strategy.
Correction SQL Agent applies the fix by rewriting SQL based on the correction plan. It uses actual column names from DESCRIBE output rather than guessing. Maximum 3 correction attempts with different strategies.

This correction loop is what makes the system robust. No blind retry; diagnostic analysis and targeted repair.

The UI shows each agent's output in real-time. You watch schema linking, see the query plan being built, view generated SQL, and observe error correction when needed.

Demo 1: Simple Query Success

Let's start with a straightforward query.

Query: "What are the top 5 best-selling tracks by total revenue?"

Demo 1 - Simple Query Success - Image by Author

Here's what happens behind the scenes:

Schema Linking identifies the tables needed: Track (for names), InvoiceLine (for sales), and the relationship TrackId connecting them.
Subproblem decomposition breaks it into: SELECT track name and revenue sum, JOIN Track with InvoiceLine, GROUP BY track name, ORDER BY revenue descending, LIMIT to 5.
Query Plan reasons through the logic: start with Track table, join to InvoiceLine via TrackId, calculate revenue as UnitPrice × Quantity, aggregate per track, sort by total, take top 5.

SQL Generation produces:

SQL

SELECT
  t.Name AS TrackName,
  SUM(il.UnitPrice * il.Quantity) AS TotalRevenue
FROM Track t
JOIN InvoiceLine il ON t.TrackId = il.TrackId
GROUP BY t.Name
ORDER BY TotalRevenue DESC
LIMIT 5;

Result: ✅ Success in first attempt. 5 rows returned in 2.8 seconds.

This shows the pipeline working smoothly when no errors occur.

Demo 2: Ambiguous Column Error with Guided Correction

Now the interesting part.

Query: "Get all invoice items with their unit price and track unit price"

Demo 2 - Ambiguous Column Error - Image Generated by Author

The problem: Both InvoiceLine and Track tables have a UnitPrice column. The initial SQL failed with "ambiguous column reference."

How the correction worked:

Correction Plan Agent classified the error as schema_link.ambiguous_col
Ran DESCRIBE InvoiceLine and DESCRIBE Track to see actual schemas
Identified that both tables contain UnitPrice
Generated fix strategy: add table prefixes to disambiguate
Correction SQL Agent regenerated SQL with proper table aliases

Result: ✅ Success on second attempt after targeted correction.

Error Taxonomy in Action

Here's how the taxonomy guides corrections for different error types:

Error Categories and Solutions - Image by Author

Each category has a known solution pattern. The correction agent identifies the category and applies the appropriate fix.

Why This Approach Works

Systematic Error Handling

Instead of random retry, each error type has a diagnostic procedure and fix strategy. The agent knows what to check and how to repair it.

Dynamic Schema Inspection

The correction agent runs DESCRIBE queries to see actual table structures. No hardcoded schema knowledge; it adapts to any database.

Multi-Agent Collaboration

Each agent specializes:

Schema linking focuses on identifying relevant tables
Query planning focuses on logical flow
SQL generation focuses on syntax
Correction focuses on debugging

This separation reduces cognitive load on each LLM call.

Transparency

Users see every step. When errors occur, they understand what went wrong and how it was fixed. This builds trust in the system.

Limitations and Tradeoffs

1. No Value Retrieval Agent

The paper's agents don't retrieve sample data from tables to inform filtering decisions.

Example limitation: For "show me customers in California," the system must guess whether the database uses "California," "CA," "calif," or another variant.

A value retrieval agent could run:

SELECT DISTINCT State FROM Customer LIMIT 20;

This would show actual values: ["CA", "NY", "TX"…] and inform the WHERE clause construction.

Without this, queries fail when:

State abbreviations vs full names differ from user's phrasing
Date formats are ambiguous (MM/DD/YYYY vs YYYY-MM-DD)
Category names use internal codes ("cat_001" vs "Electronics")
Case sensitivity matters ("active" vs "Active")

2. Token Costs

Multi-agent architecture uses more tokens than single-call approaches. The demo averaged:

Simple query: ~2,000 tokens ($0.003)
Complex query with correction: ~5,500 tokens ($0.008)

Production systems would need cost optimization through hybrid models (reasoning models for planning, cheaper models for execution).

3. Latency

Six agent calls take longer than one call. Average end-to-end time:

Single-agent baseline: 1.8 seconds
SQL-of-Thought: 4.2 seconds

The accuracy gain justifies the latency for most use cases.

Future Improvements

Several directions could enhance the demo:

Fine-tuned models for specific agent roles. A small model trained on schema linking could reduce costs while maintaining accuracy.
Query execution plan visualization. Show users how the database actually executes the generated SQL, not just the SQL itself.
Support for more databases. PostgreSQL, MySQL, SQL Server each have dialect-specific quirks that need custom correction strategies.
Interactive correction guidance. Let users provide hints when automatic correction fails, teaching the system new error patterns.
Performance optimization tracking. Not just correctness; also measure query execution time and suggest index improvements.

Conclusion

Image Generated by Author Using AI

The SQL-of-Thought paper introduces a practical framework for reliable Text-to-SQL translation.

The key insight: categorize errors systematically and apply targeted fixes based on error type. This transforms SQL generation from a trial-and-error process into a diagnostic system.

Building the demo confirmed that multi-agent architecture with taxonomy-guided correction works on real data. The approach handles ambiguous columns, missing joins, wrong aggregations, and other common failures that break single-agent systems.

The 91.59% accuracy on the Spider benchmark isn't just a number. It represents a systematic approach to error handling that makes Text-to-SQL reliable enough for production use.

Try It Yourself

Full source code and working demo available on GitHub

View on GitHub →

Need Help Implementing Multi-Agent Text-to-SQL?

I offer consulting services for advanced Text-to-SQL systems. Let's discuss your requirements.

Schedule a Consultation