Skip to main content

Database Schema as Context

AI agents that interact with databases (writing queries, reviewing migrations, generating code) perform better when they have access to structured schema context. Without it, agents guess table names, invent columns, miss constraints, and ignore relationships.

This guide covers how companies solve this problem, how Atlas organizes schema files to provide context for AI coding assistants like Claude Code, Codex, and Cursor, and includes a benchmark comparing query accuracy with and without schema context.

How Companies Approach This

Several companies have published approaches to improving text-to-SQL accuracy through schema context:

  • Pinterest uses a multi-stage pipeline: schema linking identifies relevant tables, retrieval finds similar past queries as examples, then SQL is generated with both DDL and examples in context.
  • AWS recommends augmenting raw DDL with column descriptions and business rules as SQL comments.
  • Tiger Data embeds descriptions in PostgreSQL using COMMENT ON statements. Their benchmark showed accuracy improving from 58% to 86% with semantic descriptions versus raw DDL.
  • Google Cloud uses vector search to retrieve only the relevant subset of a large schema per query.

The common pattern: give the AI more structured context about the database. The differences are in how that context is organized and delivered.

Types of Database Context

AI agents need three kinds of context to work with databases:

Context TypeWhat It ContainsExample
Schema contextTables, columns, types, constraints, dependencies, viewstasks.assignee_id is a FK to users.id
Business contextDomain rules, naming conventions, data semanticsquality_status = 'rework' means the batch failed QA but can be salvaged
Usage contextCommon query patterns, existing viewsThere's a daily_production_summary materialized view, so don't recompute it

Most tools provide none of these. The agent has to discover everything on its own: which table stores production data, what valid status values exist, or whether there's already a view that answers the question.

Schema Code as Context

With a single atlas schema inspect command, Atlas exports any database into organized, version-controlled SQL files. Atlas supports PostgreSQL, MySQL and MariaDB, SQL Server, SQLite, ClickHouse, Redshift, CockroachDB, Oracle, Snowflake, Spanner, and more.

Instead of a single dump (like pg_dump or mysqldump), the output is modular SQL files with clear dependencies that AI agents can read and work with like any other code in the repository.

To demonstrate, we ran a benchmark on an example schema with 43 tables, 47 enum types, 64 functions, and 13 views across two schemas. After running atlas schema inspect, the output looks like this:

sql/
├── main.sql # Import manifest
└── schemas/
├── manufacturing/
│ ├── manufacturing.sql # Schema definition
│ ├── tables/
│ │ ├── equipment.sql
│ │ ├── production_lines.sql
│ │ ├── production_runs.sql
│ │ └── maintenance_work_orders.sql
│ ├── types/
│ │ ├── enum_equipment_status.sql
│ │ └── enum_quality_status.sql
│ ├── functions/
│ │ ├── calculate_oee.sql
│ │ └── schedule_preventive_maintenance.sql
│ └── views/
│ ├── daily_production_summary.sql
│ └── equipment_health_dashboard.sql
└── public/
├── tables/
│ ├── users.sql
│ ├── projects.sql
│ └── tasks.sql
├── types/
│ ├── enum_task_status.sql
│ └── enum_priority_level.sql
├── functions/
│ └── calculate_project_progress.sql
└── views/
└── project_portfolio_dashboard.sql

Self-Describing Files

Each SQL file contains the complete DDL for one object: columns, types, constraints, indexes, and checks:

sql/schemas/manufacturing/tables/production_runs.sql
-- atlas:import ../manufacturing.sql
-- atlas:import production_lines.sql
-- atlas:import ../types/enum_quality_status.sql
-- atlas:import ../../public/tables/projects.sql
-- atlas:import ../../public/tables/users.sql

-- A production run tracks a single batch on one production line.
-- Runs with end_time = NULL are still in progress and should be
-- excluded from yield calculations.
CREATE TABLE "manufacturing"."production_runs" (
"id" serial NOT NULL,
"production_line_id" integer NOT NULL,
"project_id" integer NULL,
"assigned_user_id" integer NULL,
"product_code" character varying(50) NOT NULL,
"batch_number" character varying(100) NOT NULL,
"planned_quantity" integer NOT NULL,
"actual_quantity" integer NULL,
"quality_status" "manufacturing"."quality_status" NOT NULL DEFAULT 'pending',
"start_time" timestamptz NULL,
"end_time" timestamptz NULL,
"yield_percentage" numeric(5,2) NULL,
PRIMARY KEY ("id"),
CONSTRAINT "production_runs_production_line_id_fkey"
FOREIGN KEY ("production_line_id")
REFERENCES "manufacturing"."production_lines" ("id"),
CONSTRAINT "production_runs_project_id_fkey"
FOREIGN KEY ("project_id")
REFERENCES "public"."projects" ("id"),
CONSTRAINT "production_runs_planned_quantity_check"
CHECK (planned_quantity > 0),
CONSTRAINT "production_runs_time_order"
CHECK ((end_time IS NULL) OR (start_time IS NULL)
OR (end_time > start_time))
);

From this single file, an AI agent learns the exact column names and types, that production_line_id references manufacturing.production_lines, that project_id crosses into the public schema, that quantity must be positive, and that end_time being NULL means a run is still in progress.

When the agent sees a custom type like "manufacturing"."quality_status", the import directive -- atlas:import ../types/enum_quality_status.sql tells it exactly where the valid values are defined.

Also, since these are regular source files (not database metadata), you can add code comments to provide business context. Unlike COMMENT ON statements, these don't require a migration to update and won't clutter your database catalog.

Import Directives as Dependency Graphs

The -- atlas:import lines at the top of each file declare dependencies:

-- atlas:import ../types/enum_quality_status.sql    -- needs the quality_status enum
-- atlas:import ../../public/tables/projects.sql -- cross-schema FK to projects
-- atlas:import ../../public/tables/users.sql -- cross-schema FK to users

This provides the relationship graph without requiring view, function body, trigger, or stored procedure parsing. When asked "what data is related to production runs?", the imports immediately show: production lines, quality status enum, projects, and users.

info

For more on organizing schema files with imports, see the SQL Schema documentation.

Benchmark

We ran three AI agents (Claude Opus 4.6) against the same PostgreSQL database with the same 10 questions. Each agent had a different level of schema context.

Setup

  • Database: PostgreSQL 16, 43 tables across 2 schemas (public and manufacturing), 47 enum types, 64 functions, 13 views
  • Agent A (no context): given only the postgres:// connection string
  • Agent B (single file): given the full schema DDL (3,615 lines) as a single SQL file exported by atlas schema inspect
  • Agent C (schema directory): given the organized one-file-per-object directory with code comments explaining business rules
  • Questions: 10 analytical queries ranging from simple lookups to cross-schema joins with business logic

Results

MetricAgent A (No Context)Agent B (Single File)Agent C (Schema Directory)
Correct answers7/109/1010/10
Silent wrong answers200
Schema discovery queries1000
Schema files read01 (3,615 lines)20 of ~150 files
Total tokens consumed40,58571,75445,184
Total tool calls312037
Duration189s174s149s

Agent A ran 10 introspection queries before writing a single answer: listing tables, enumerating 226 enum values, mapping columns across schemas, and tracing 67 foreign key relationships.

Agent B loaded the entire 3,615-line DDL into context, consuming the most tokens of all three. Agent C read only the 20 files relevant to its questions and used 37% fewer tokens.

Where No-Context Went Wrong

All three agents produced valid SQL that executed without errors. The differences were semantic: queries that return plausible but incorrect data. Below are examples where Agents A and B got wrong results while Agent C answered correctly.

Question 2: Production yield for WDG-100

One production run is still in progress (actual_quantity is NULL, quality_status is 'pending'). It should be excluded from yield calculations.

-- Agent A (no context): included incomplete run → 77.14% (WRONG)
SELECT ROUND(100.0 * SUM(actual_quantity) / SUM(planned_quantity), 2)
FROM manufacturing.production_runs
WHERE product_code = 'WDG-100';

-- Agent C (schema directory): filtered to completed runs → 93.55% (CORRECT)
-- Read the comment: "Runs with end_time = NULL are still in progress
-- and should be excluded from yield and efficiency calculations."
SELECT ROUND(AVG(yield_percentage), 2)
FROM manufacturing.production_runs
WHERE product_code = 'WDG-100'
AND end_time IS NOT NULL;

A 17 percentage point difference is large enough to trigger false alarms on a production dashboard. Agent C read the code comment in production_runs.sql and applied the filter. Agent A had no way to know this convention.

Question 3: Supplier tier classification

All three agents found the same 2 at-risk suppliers, but Agent A guessed the tier thresholds:

-- Agent A: guessed thresholds
WHEN quality_rating >= 4.0 AND delivery_rating >= 4.0 THEN 'preferred'
WHEN quality_rating >= 3.0 AND delivery_rating >= 3.0 THEN 'approved'

-- Agents B and C: extracted from supplier_performance_scorecard view
WHEN quality_rating >= 4.5 AND delivery_rating >= 4.5 THEN 'preferred'
WHEN quality_rating >= 3.5 AND delivery_rating >= 3.5 THEN 'approved'

With different data, Agent A's thresholds would classify suppliers into the wrong tiers.

Question 4: Production line efficiency

Agent A counted 6 runs for Widget Line A with a 66.67% pass rate. Agents B and C counted 5 runs with 80.00%. Same root cause: Agent A included a run where end_time is NULL (still running).

Why the Directory Approach Wins

The schema directory agent scored 10/10 while using 37% fewer tokens and finishing faster than the single-file agent. Context was necessary for correct answers, but the right structure also sped up context retrieval:

  1. Code comments as business rules. A two-line comment in production_runs.sql prevented the errors that tripped up the no-context agent. Unlike COMMENT ON statements, these comments live in source files and don't require migrations to update.

  2. Selective file reading. Instead of loading 3,615 lines into context, the agent read only the 20 files relevant to its 10 questions. For larger schemas that exceed context limits, this selective reading is the only option that works.

  3. Enums as vocabulary. Enum types tell the agent which values are valid. Without the equipment_status enum definition, the agent guesses (WHERE status = 'broken') or has to query pg_enum and pg_type at runtime. With the enum file, it uses the correct values immediately.

  4. Views as pre-answered questions. The daily_production_summary materialized view answers "what happened on the production floor today?" with yields, quality counts, and run durations already computed. An agent with this view in context queries it directly instead of writing a complex aggregation from scratch.

  5. Functions as business logic. The equipment_health_dashboard view references calculate_oee, a function that computes Overall Equipment Effectiveness (availability x performance x quality). Without its definition, the agent guesses the formula. With it, the function body and comments explain the classification thresholds. Atlas also provides a testing framework for database logic, giving agents context on expected behavior and edge cases without executing anything.

The no-context agent reported all answers as correct. Every query ran without errors and returned plausible results. But two of them were semantic errors: valid SQL that silently returns wrong data.

Providing Schema Context

1. Store Schema as Code

AI tools can query the database schema at runtime using atlas schema inspect with the --include flag to retrieve specific parts. But storing the schema as version-controlled files means agents can find them using the same tools they use for any code: file search (grep, glob), vector embeddings (Cursor's codebase indexing), or agentic exploration (Claude Code's read-only sub-agents). No live database connection needed, and you can add code comments and test files alongside the DDL.

To get started, export your existing database into a structured, context-rich schema directory using atlas schema inspect:

atlas schema inspect -u "postgres://localhost:5432/mydb" --format '{{ sql . "  " }}'

See the Export Database to Code guide for a full walkthrough.

2. Use One File Per Object

A single migration file with thousands of lines is poor context. Organized files are rich context:

sql/schemas/manufacturing/tables/equipment.sql
sql/schemas/manufacturing/types/enum_equipment_status.sql
sql/schemas/manufacturing/functions/calculate_oee.sql

An agent working on a manufacturing query can read just the relevant files instead of parsing the entire schema.

3. Use Import Directives

Atlas -- atlas:import directives declare dependencies between files:

-- atlas:import ../types/enum_quality_status.sql
-- atlas:import ../../public/tables/projects.sql

These are automatically generated when you run atlas schema inspect. If you maintain your schema files manually, you can add import directives yourself to keep the dependency graph accurate. Either way, an AI agent following imports can discover the full context chain for any object without scanning the entire schema.

4. Include Functions and Views

Functions encode business logic. Views encode common query patterns. Both prevent the agent from reinventing complex calculations:

-- Instead of writing a 40-line OEE calculation:
SELECT manufacturing.calculate_oee(3, '2026-03-01');

-- Instead of a complex production aggregation:
SELECT * FROM manufacturing.daily_production_summary
WHERE production_date = CURRENT_DATE;

Use Atlas's testing framework to document expected behavior. Tests tell agents what a function does more clearly than comments:

schema.test.hcl
test "schema" "oee_world_class" {
exec {
sql = "SELECT (manufacturing.calculate_oee(1, '2025-03-01'))->>'classification'"
output = "world_class"
}
}

5. Use Self-Explanatory Constraint Names

Named CHECK constraints are machine-readable business rules:

CONSTRAINT "projects_budget_spent_valid"
CHECK (budget_spent >= 0
AND (budget_allocated IS NULL
OR budget_spent <= budget_allocated * 1.1))

This tells the agent that budget overspend up to 10% is allowed, without requiring a comment.

Getting Started

Export an existing database to organized schema files:

# Install Atlas
curl -sSf https://atlasgo.sh | sh

# Export to a single SQL file
atlas schema inspect -u $DATABASE_URL --format '{{ sql . }}' > schema.sql

# Export to organized one-file-per-object structure
atlas schema inspect -u $DATABASE_URL --format '{{ sql . | split | write "sql" }}'

The second command generates the file structure described in this guide: one file per table, enum, function, and view, with -- atlas:import directives declaring dependencies between them.

Once the files are in your repository, every AI tool that reads your codebase has access to the full schema context.

Further reading