Skip to main content

The Missing Chapter in the Platform Engineering Playbook

ยท 11 min read
Rotem Tamir
Building Atlas

Prepared for SREDay London 2025

Introductionโ€‹

Platform engineering is rapidly emerging as a discipline aimed at reducing cognitive load for developers, enabling self-service infrastructure, and establishing best practices for building and operating software at scale. While much of the conversation focuses on CI/CD, Kubernetes, and internal developer platforms, one crucial aspect often remains overlooked: database schema management.

Despite being at the heart of nearly every application, schema changes are still a major source of friction, outages, and bottlenecks. In this post, we'll explore why database schema management deserves a dedicated chapter in the platform engineering playbook and how organizations can integrate it into their platform strategies.

The prompt that nuked the databaseโ€‹

Let me tell you a not-so-fictional story about a developer named Alice. Alice is a backend engineer at a fast-growing startup. One day, her manager asked her to make a small change to the database. The data engineering team was complaining that they were seeing duplicate emails in the user table, and they suspected that the email column did not have a unique constraint.

Busy as she was working on the next release, Alice asked her trusty GitHub Co-pilot to assist her:

Alice: Hey Co-pilot, can you add a unique constraint to the email column in the users table?

Alice's team is using Sequelize, a popular ORM for Node.js, to manage their database schema. Co-pilot zipped through and added the constraint to the model:

email: {
allowNull: false,
type: DataTypes.STRING,
validate: {
isEmail: true
},
+ unique: true,
},

After adding the constraint, Alice knew she had to keep her database schema in sync with the application code. To do this, she needed to add a schema migration that would add the constraint to the database. Confident in her co-pilot's coding skills, she asked it to generate the migration for her:

Alice: Hey Co-pilot, can you please add the migration for this change?

Co-pilot generated:

'use strict';

module.exports = {
async up(queryInterface, Sequelize) {
await queryInterface.changeColumn('users', 'email', {
type: Sequelize.STRING,
allowNull: false,
unique: true // ๐Ÿ‘‰ This adds the UNIQUE constraint
});
},

async down(queryInterface, Sequelize) {
// redacted
}
};

Alice ran the migration on her local machine, and everything seemed fine. She tested the code on the staging environment, ran the migration, and everything seemed fine there as well. Alice's team reviewed the code, which seemed simple and straightforward, and approved the PR.

Confidently, she merges the code to the main branch and deploys it to production.

And then... the API starts timing out. User signups and logins grind to a halt. CPU usage on the database spikes. Even unrelated queries on the users table start hanging.

The Problem: Unique Constraint Validation & Table Lockingโ€‹

Why did this happen?

Before applying the migration, PostgreSQL needs to ensure there are no duplicate emails before enforcing the constraint. If the table is large, this can take minutes or hours. During this time, PostgreSQL cannot allow queries can read or write anything from the table.

To do this, it acquires an ACCESS EXCLUSIVE lock on users while validating. This means no reads or writes can happen on users until validation is complete. To make things worse, If existing transactions are holding locks, your migration waits indefinitely, blocking everything else.

In other words, to guarantee the integrity of the unique constraint, PostgreSQL must "stop the world" for our users table and perform a full table scan, which can take quite a while. If this is a busy table (like the users table usually is), this means that requests will start piling up, draining resources from your database and servers and making things even worse.

Post-mortem: Why did this really happen?โ€‹

When something goes wrong, it's easy to blame the developer. But the truth is, Alice did everything right. She followed the best practices, tested her code, and reviewed it with her team. If the problem wasn't with Alice, where was it?

One technique for getting to the root cause of such a severe incident is the "5 Whys" technique. Let's apply it here:

  1. Why did the API start timing out? Because the migration locked the users table.
  2. Why did the migration lock the users table? Because our developers shipped broken code.
  3. Why did our developers ship broken code? Because it was approved by our quality assurance process.
  4. Why was it approved by our quality assurance process? Because we rely on manual code reviews and the reviewer didn't know about the issue.
  5. Why do we rely on manual code reviews? ... ?

Why do Platforms ignore Schema Management?โ€‹

The incident with Alice is not unique. Schema changes are a common source of outages, and yet, they are often overlooked in platform engineering discussions. Why?

I've interviewed dozens of platform engineers over the past few years, and the answer is oftentimes: "Schema migrations are an application concern, not a platform concern."

But is this really true?

Migrations are risky businessโ€‹

Even with the best intentions, errors can still arise during the development and review of migrations, leading to downtime and data loss. Some of the failure modes that should be considered include:

  • Destructive changes - The SQL standard includes DROP commands that can remove tables, columns, and other database objects. While powerful, these commands can lead to data loss if not used carefully. Since most databases do not feature a built-in "undo" button, it's crucial to ensure that these commands are executed with caution.
  • Breaking changes - Schema changes can introduce breaking changes to the application code. For example, if a column is renamed or its type is changed, the application code may need to be updated to reflect these changes. This can lead to runtime errors if not handled properly.
  • Constraint violations - Changing constraints on a table can lead to constraint violations if the existing data does not meet the new constraints. For example, adding a NOT NULL constraint to a column that already contains null values will cause the migration to fail. This is especially problematic since these errors are typically not detected in development and staging environments (where datasets are often limited and less diverse), but only in production.
  • Table locks - Some changes, as we saw in Alice's case, require the database to acquire exclusive locks on the table, preventing any reads or writes during the migration. This can lead to downtime and degraded performance, especially for large tables.

For these reasons, schema changes should be treated with the same level of care and attention as application code changes. They should be tested, reviewed, and validated automatically before being deployed to production.

It's about to get worse: What GenAI means for databasesโ€‹

The rise of Generative AI (GenAI) is changing the way we write code and interact with databases. With tools like GitHub Copilot and ChatGPT, developers are increasingly relying on AI to assist them in writing code, including database queries and schema changes. Let's reflect on some of the implications of this shift:

  • A lot more code, with less attention - As developers increasingly rely on AI to generate code, the amount of code being written is skyrocketing. LLMs are very adept at generating text that looks sensible, but not always is.

    On one extreme, this leads to hallucinations - where the AI generates code that is actually nonsensical. On the other extreme, it leads to unintended consequences - where the AI generates code that is syntactically correct, but semantically incorrect or has unintended side effects, as we saw with Alice's migration post-mortem.

    Since the code that LLMs produce will often look right, it can be particularly tricky to catch issues with normal, manual code reviews. This is especially true for database migrations, where the LLM may generate correct code that does not consider the safety implications of the change.

  • Skill-level needed to build apps is going down - As GenAI tools become more prevalent, we can expect to see a wider range of people writing code, including non-engineers. With the advent of "vibe coding" - the practice of using natural language prompts to iteratively build applications, the bar for writing code is dropping.

    Whether it's internal tools or just use-case specific SQL queries, the barrier to entry for writing code is dropping quickly. If you can write a prompt, you can write code. This is great for democratizing access to building software, but it also means that we will soon have people that aren't only non-database-experts, but actually do not have the technical knowledge to understand the potential implications of their changes.

  • Agents are creating new pressure - As organizations start to provide users with means to create agentic workflows that interact with the company's data assets and APIs, we can expect to see quickly changing access patterns to databases. Agents are not bound by the same constraints as humans, and they can generate and execute queries at a much faster rate. This can lead to increased contention, performance degradation, and even outages if the database is not able to keep up with the increased load.

    An interesting implication of this is that even legacy applications that have not been touched in years can suddently find themselves on the "hot path" of someone's agentic workflow.

    Contrary to "normal" application development, where queries are planned and discussed by architects and developers, agentic workflows by definition exhibit emergent behavior. Organizations will need to be able to quickly identify and adapt to these changes in order to maintain performance and reliability.

Case study: Unico's Database CI/CD Pipelineโ€‹

As we consider if schema management is a platform concern, let's look at a real-world example. Unico, a leading digital identity provider in Brazil was dealing with similar challenges. Having suffered multiple outages that stemmed from database schema changes, they decided to take action.

Unico's Platform Engineering team, led by Luiz Casali, recognized that while they had streamlined CI/CD workflows for application code and infrastructure, database schema changes were still a major gap. These changes were often performed manually, leading to increased risk, unreliable deployments, and compliance headaches. Some teams were using tools that provided some automation, but they were complex and required specialized knowledge.

To address these challenges, the team sought a solution that would automate database schema management while integrating seamlessly into their existing workflows. After evaluating multiple tools, they chose Atlas, a schema-as-code solution that brought the declarative mindset of Terraform to databases.

Atlas provided Unico with multiple benefits:

  • Automatic migration planning - Atlas automatically plans the migration for developers, calculating the diff between their code and the database schema.
  • Automated verification - Atlas comes with a migration analysis tool that verifies the migration plan before applying it to the database, catching issues before they reach production. This step was integrated with Unico's standard CI/CD pipeline, ensuring that every migration was verified before deployment.
  • Automated deployment - Atlas integrates with Unico's deployment tooling tooling including complex cases like automated rollbacks, completely removing the need for manual intervention.
  • Compliance and governance - Atlas automatically generates documentation for every migration, providing a clear audit trail for compliance and governance purposes. Additionally, Atlas provides schema monitoring capabilities, allowing Unico to track changes to their database schema over time and make sure that no unexpected changes are introduced.

Recognizing the importance of schema management, Unico's Platform Engineering team made it a core part of their platform strategy. As a result of this change, Unico saw a significant reduction in outages, improved developer productivity, and a more reliable platform:

This success reinforced the decision to adopt Atlas:

"Month over month, we see smaller and smaller incidents.

โ€” Luiz Casali, Senior Engineering Manager

Unico's experience highlights the clear benefits of treating schema management as a platform concern:

โœ… Automatic Migration Planning โ€” No DB expertise needed โœ… Automated Code Review โ€” No surprises in production โœ… Automated Deployment โ€” No more human errors โœ… Everything is Audited โ€” Easy compliance

Why Schema Management is a Platform Concernโ€‹

Schema management should be a core platform responsibility, yet it is often overlooked. Let's break it down:

  • Reducing Cognitive Load โ€“ Developers shouldn't have to master database internals. Platforms should provide declarative workflows and automation, letting engineers focus on features, not migrations.
  • Increasing Reliability โ€“ Schema changes cause outages when done manually. Automated validation in CI/CD catches issues early, ensuring safe, predictable deployments.
  • Boosting Velocity โ€“ When schema changes are easier and safer, teams ship more of them. Instead of fearing database updates, engineers iterate confidently, reducing bottlenecks.

Let's stop treating schema management as if it's optional and treat it as a first-class citizen in our platform strategies.

Conclusionโ€‹

Database schema management can no longer be just an "application concern" โ€” it's becoming a platform responsibility. As engineering teams scale, schema changes must be automated, validated, and integrated into CI/CD workflows to reduce risk and improve developer experience. With GenAI accelerating change and reshaping who writes code and how, the need for structured, platform-driven schema management has never been greater.

Atlas v0.32: Ask AI, SQL Imports, and More

ยท 11 min read
Rotem Tamir
Building Atlas

Hey everyone!

It's been a few weeks since our last release, and we're excited to share today everything that's new in Atlas v0.32. This release is packed with new features, improvements and bug fixes that will make your experience with Atlas even better.

Here are the highlights of this release:

  • Ask AI - Since its modest beginning, Atlas has come a long way. What started as a simple CLI tool for declarative schema management is now a full-blown platform. We know that the learning curve for new users can be steep, which is why we are introducing new AI-powered features to help you get started with Atlas.
  • SQL Importing - As projects grow, teams often want to split their schema definition across multiple files. Because SQL definitions are imperative and rely on the order of statements, splitting them can be challenging. With the new importing feature its easy to break large SQL schema definitions into smaller parts while keeping them correct and ordered.
  • Improved Ent Loader - Users of the popular Ent ORM can use the ent:// URL scheme to load their schema into Atlas. We have added support for multi-schema migrations, composite schemas, and Ent's globalid feature.
  • SQL Server Improvements - We have made several improvements to the SQL Server support in Atlas, including support for Full Text Search Index and Temporal Tables.
  • PostgreSQL Improvements - We have added support for defining Foreign Servers and Unlogged Tables in PostgreSQL.

To download and install the latest release of the Atlas CLI, simply run the following in your terminal:

curl -sSf https://atlasgo.sh | sh

Ask AIโ€‹

Atlas has always been about making schema management easier for developers. We know that the learning curve for new users can be steep, which is why we are introducing new AI-powered features to help you get started with Atlas.

Ask the Docsโ€‹

As you may have noticed, we recently added an "Ask AI" button to the top navigation bar. When clicking on it users can utilize a familiar chat-based interface to ask questions about Atlas. The answers are generated by an AI assistant that has access to the entire Atlas documentation.

The feature is still in its early stages, so we would appreciate if you could leave us feedback about any improvements we could make via the built-in feedback buttons or on our Discord server.

Ask the CLIโ€‹

To help you stay in the flow while using the CLI, we have added a new command: atlas ask. Use this command whenever you encounter an error in the CLI. The AI assistant will analyze the error message and attempt to provide a solution.

For example suppose you encounter an error like this:

You can run atlas ask and the AI assistant will provide you with a solution:

  You attempted to generate a migration using the  atlas migrate diff  command
with --dev-url pointing to a PostgreSQL database. However, the process failed
with the error:

sql/sqlimport: taking database snapshot before computing module state:
sql/migrate: connected database is not clean: found table "docs" in schema
"public"

### Why This Happens

The atlas migrate diff command requires the dev-database (specified by --dev-url)
to be a temporary, empty database. Atlas uses this dev-database to compute schema differences.
Your error suggests that the database you specified with --dev-url is not empty but contains
the table docs , which is causing Atlas to stop execution.

### How to Fix It

< redacted ... >

SQL Importsโ€‹

As projects grow, teams often want to split their schema definition across multiple files. Because SQL definitions are imperative and rely on the order of statements, splitting them can be challenging. To enable teams to logically split their schema definitions into smaller parts while keeping them correct and ordered, we have added a new importing feature.

Suppose you are maintaining an ecommerce platform where your database schema is split into two logical schemas: one for tracking customer and order data named crm and another for tracking back-office operations named backoffice. The schemas are largely separate, however they share some common logic such as domain types.

Using the new importing feature, you can create a project structure like this:

.
โ”œโ”€โ”€ backoffice
โ”‚ โ””โ”€โ”€ tables.sql
โ”œโ”€โ”€ common.sql
โ”œโ”€โ”€ crm
โ”‚ โ””โ”€โ”€ tables.sql
โ””โ”€โ”€ main.sql

Common objects can reside in the common.sql file:

common.sql
CREATE SCHEMA common;

CREATE DOMAIN common.person_id AS VARCHAR
CHECK (VALUE ~* '^[A-Za-z0-9_]{6,20}$');

This file defines a username domain type, enforcing a basic format of letters, numbers, and underscores with a length of 6-20 characters.

Then, for each schema, you can define the schema-specific objects in separate files, but note how we use atlas:import to tell Atlas that they depend on the common.sql file:

crm/tables.sql
-- atlas:import ../common.sql

CREATE SCHEMA crm;

CREATE TABLE crm.customers
(
id serial PRIMARY KEY,
full_name varchar NOT NULL,
username common.person_id UNIQUE NOT NULL
);

This file defines the customers table for tracking customer data and uses the username domain which is defined in common.sql. Next, we define the employees table for tracking employee data in the backoffice schema:

backoffice/tables.sql
-- atlas:import ../common.sql

CREATE SCHEMA backoffice;

CREATE TABLE backoffice.employees
(
id serial PRIMARY KEY,
full_name text NOT NULL,
position text NOT NULL,
username common.person_id UNIQUE NOT NULL
);

Finally, we create a main.sql that stitches the schemas together:

main.sql
-- atlas:import backoffice/
-- atlas:import crm/

We can now use the main.sql file as the entry point for our schema definition in our Atlas project:

atlas.hcl
env {
src = "file://main.sql"
dev = "docker://postgres/17/dev"
name = atlas.env
}

If we apply this schema to a PostgreSQL database, Atlas will properly order the schema definitions using topological sort to ensure semantic correctness:

atlas schema apply --env local

Atlas computes the plan and asks for confirmation before applying the changes:

Planning migration statements (6 in total):

-- add new schema named "backoffice":
-> CREATE SCHEMA "backoffice";
-- add new schema named "common":
-> CREATE SCHEMA "common";
-- add new schema named "crm":
-> CREATE SCHEMA "crm";
-- create domain type "person_id":
-> CREATE DOMAIN "common"."person_id" AS character varying CONSTRAINT "person_id_check" CHECK ((VALUE)::text ~* '^[A-Za-z0-9_]{6,20}$'::text);
-- create "employees" table:
-> CREATE TABLE "backoffice"."employees" (
"id" serial NOT NULL,
"full_name" text NOT NULL,
"position" text NOT NULL,
"username" "common"."person_id" NOT NULL,
PRIMARY KEY ("id"),
CONSTRAINT "employees_username_key" UNIQUE ("username")
);
-- create "customers" table:
-> CREATE TABLE "crm"."customers" (
"id" serial NOT NULL,
"full_name" character varying NOT NULL,
"username" "common"."person_id" NOT NULL,
PRIMARY KEY ("id"),
CONSTRAINT "customers_username_key" UNIQUE ("username")
);

-------------------------------------------

Analyzing planned statements (6 in total):

-- no diagnostics found

-------------------------
-- 96.561ms
-- 6 schema changes

-------------------------------------------

? Approve or abort the plan:
โ–ธ Approve and apply
Abort

Notice a few interesting things:

  1. The common.sql file is imported only once, even though it is imported by both crm/tables.sql and backoffice/tables.sql.
  2. The schema definitions are ordered correctly, with the common schema being created first, followed by the crm and backoffice schemas.
  3. We did not need to think about the order of the statements in our project, as Atlas took care of that for us.

We hope this new feature will make it easier for you to manage large schema definitions in Atlas.

Ent Loaderโ€‹

Up until this release, the ent:// URL scheme, used to load Ent projects into Atlas, could not be used within a composite schema or multi-schema setup if globally unique IDs were enabled.

The Ent project recently added ent schema command, which means that Ent now fully supports the Atlas External Schema spec and thus the ent:// URL can now be used without limitations. For example:

data "composite_schema" "ent_with_triggers" {
schema "ent" {
url = "ent://entschema?globalid=static"
}
# Some triggers to be used with ent.
schema "ent" {
url = "file://triggers.my.hcl"
}
}

env {
name = atlas.env
src = data.composite_schema.ent_with_triggers.url
dev = "docker://mysql/8/ent"
migration {
dir = "file://ent/migrate/migrations"
}
}

With the recent changes, you can now use the ent:// URL scheme in a composite schema, multi-schema setup even with globally unique IDs enabled.

SQL Serverโ€‹

We have made several improvements to the SQL Server support in Atlas, including support for Full Text Search Index and Temporal Tables. Full-Text Index Search in SQL Server allows efficient querying of large text-based data using an indexed, tokenized search mechanism. To create a Full-Text Index in SQL Server, you can use the following syntax:

table "t1" {
schema = schema.dbo
column "c1" {
null = false
type = int
}
column "c2" {
null = true
type = varbinary(-1)
}
column "c3" {
null = true
type = varchar(3)
}
column "c4" {
null = true
type = image
}
column "c5" {
null = true
type = varchar(3)
}
primary_key {
columns = [column.c1]
}
index "idx" {
unique = true
columns = [column.c1]
nonclustered = true
}
fulltext {
unique_key = index.idx
filegroup = "PRIMARY"
catalog = "FT_CD"
on {
column = column.c2
type = column.c3
language = "English"
}
on {
column = column.c4
type = column.c5
language = "English"
}
}
}
schema "dbo" {
}

In this example, the t1 table is defined with a fulltext block that specifies the columns to be indexed and the language to use.

For more examples, consider the Atlas Docs.

Temporal Tablesโ€‹

SQL Server Temporal Tables automatically track historical changes by maintaining a system-versioned table (current data) and a history table (previous versions with timestamps). This allows for time travel queries (FOR SYSTEM_TIME), making it useful for auditing, data recovery, and point-in-time analysis. Here's an example of how to define a Temporal Table using Atlas HCL:

schema "dbo" {}
table "t1" {
schema = schema.dbo
column "c1" {
type = int
null = false
}
column "c2" {
type = money
null = false
}
column "c3" {
type = datetime2(7)
null = false
generated_always {
as = ROW_START
}
}
column "c4" {
type = datetime2(7)
null = false
generated_always {
as = ROW_END
}
}
primary_key {
on {
column = column.c1
desc = true
}
}
period "system_time" {
type = SYSTEM_TIME
start = column.c3
end = column.c4
}
system_versioned {
history_table = "dbo.t1_History"
retention = 3
retention_unit = MONTH
}
}

In this example, the t1 table is defined with a system_time period and a system_versioned history table. The system_versioned block specifies the history table name, retention period, and retention unit.

PostgreSQLโ€‹

Following requests from the Atlas community, we have added support for more lesser known features in PostgreSQL. Here they are:

Foreign Serversโ€‹

Foreign Servers in PostgreSQL allow you to access data from other databases or servers. They are used in conjunction with Foreign Data Wrappers (FDWs) to provide a unified view of data from multiple sources. Here's an example of how to define a Foreign Server using Atlas HCL:

extension "postgres_fdw"  {
schema = schema.public
}
server "test_server" {
fdw = extension.postgres_fdw
comment = "test server"
options = {
dbname = "postgres"
host = "localhost"
port = "5432"
}
}

This example defines a Foreign Server named test_server that connects to a PostgreSQL database running on localhost on port 5432.

Unlogged Tablesโ€‹

In PostgreSQL, unlogged tables are a special type of table that do not write data to the WAL (Write-Ahead Log), making them faster but less durable. To use unlogged tables in Atlas, you can define them like this:

table "t1" {
schema = schema.public
unlogged = true
column "a" {
null = false
type = integer
}
primary_key {
columns = [column.a]
}
}

WITH NO DATA Diff Policyโ€‹

When creating a materialized view in PostgreSQL, you can use the WITH NO DATA clause to create the view without populating it with data. This can be useful when you want to create the view first and populate it later. To use this feature in Atlas, you can utilize the with_no_data option in the diff block like this:

diff {
materialized {
with_no_data = var.with_no_data
}
}

Wrapping Upโ€‹

We hope you enjoy the new features and improvements. As always, we would love to hear your feedback and suggestions on our Discord server.

Schema monitoring for ClickHouse using Atlas

ยท 5 min read
Rotem Tamir
Building Atlas

Automatic ER Diagrams and Docs for ClickHouseโ€‹

When working with a relational database like ClickHouse, understanding the database schema becomes essential for many functions in the organization. Who cares about the schema? Almost everyone who interacts with your data:

  • Software engineers and architects use knowledge about the schema to make design decisions when building software.
  • Data engineers need to have an accurate understanding of schemas to build correct and efficient data pipelines.
  • Data analysts rely on familiarity with the schema to write accurate queries and derive meaningful insights.
  • DevOps, SREs, and Production Engineers use schema information (especially recent changes to it) to triage database-related production issues.

Having clear, centralized documentation of your database's schema and its changes can be a valuable asset to foster efficient work and collaboration. Knowing this, many teams have developed some form of strategy to provide this kind of documentation:

  • Diagramming tools. Teams use generic diagramming tools like Miro or Draw.io to maintain ER (Entity-Relation) Diagrams representing their database schema. While this is easy to set up, it requires manually updating the documents whenever something changes, often causing documents to go stale and become obsolete.
  • Data modeling tools. Alternatively, teams use database modeling software like DataGrip or DBeaver. While these tools automatically inspect your database, understand its schema, and provide interactive diagrams, they have two main downsides: 1) Since they run locally, they require a direct connection and credentials introducing a potential security risk; 2) They do not enable any collaboration, discussion, or sharing of information.
  • Enterprise Data Catalogs like Atlan or Alation, provide extensive schema documentation and monitoring; however, they can be quite pricey and difficult to set up.

Enter: Atlas Schema Monitoringโ€‹

Atlas offers an automated, secure, and cost-effective solution for monitoring and documenting your ClickHouse schema.

With Atlas, you can:

  • Generate ER Diagrams: Visualize your database schema with up-to-date, easy-to-read diagrams.
  • Create Searchable Code Docs: Enable your team to quickly find schema details and usage examples.
  • Track Schema Changes: Keep a detailed changelog to understand what's changed and why.
  • Receive Alerts: Get notified about unexpected or breaking changes to your schema.

All without granting everyone on your team direct access to your production database.

Getting Startedโ€‹

Let's see how to set up Schema Monitoring for your ClickHouse database with Atlas. In this guide, we demonstrate how to run schema monitoring using a GitHub Action, but this can easily be achieved from other CI platforms (such as BitBucket or GitLab).

Prerequisites

  1. A ClickHouse Cloud account with a running service.
  2. An Atlas account (start your 30-day free trial here).
  3. A GitHub repository with permissions to configure GitHub Actions workflows.

1. Create a bot token in Atlas Cloudโ€‹

Head over to your Atlas Cloud account and click on the top-level Monitoring navigation entry. Choose the GitHub Action card, and click on the Generate Token button. Copy the token.

Next, go to your GitHub repository and go to Settings -> Secrets and add a new secret called ATLAS_CLOUD_TOKEN with the value of the token you just copied.

2. Create a new GitHub Actions Workflow for schema monitoringโ€‹

Store your database URL as a repository secret named DB_URL with the value of your database url.

ClickHouse Cloud URL

This guide assumes your monitored database instance is reachable from your GitHub Actions runner, which is the case (by default) for ClickHouse Cloud-hosted databases.

Atlas uses URLs to define database connection strings (see docs), to connect to your ClickHouse cloud instance, use this format:

clickhouse://default:<PASS>@<instance ID>.eu-west-2.aws.clickhouse.cloud:9440/default?secure=true

Be sure to use the connection string details specific to your hosted ClickHouse service.

For more options, see the Schema Monitoring Docs.

Next, save the workflow file below as .github/workflows/monitor-schema.yaml in your repository.

Replace the slug with the name you want to give to your database. The slug is used to uniquely identify the database in Atlas Cloud, even when the database URL changes.


name: Atlas Schema Monitoring
on:
workflow_dispatch:
schedule:
- cron: '0 */4 * * *' # every 4 hours
jobs:
monitor:
runs-on: ubuntu-latest
steps:
- uses: ariga/setup-atlas@v0
- uses: ariga/atlas-action/monitor/schema@v1
with:
cloud-token: ${{ secrets.ATLAS_CLOUD_TOKEN }}
url: ${{ secrets.DB_URL }}
slug: my-database

Then, commit and push the changes to your repository.

3. Run the GitHub Actionโ€‹

Once committed, let's run the workflow:

  1. Go to the Actions tab in your repository
  2. Choose the Atlas Schema Monitoring workflow
  3. Click on Run Workflow on the top right corner.

After the workflow finishes, you should see a link to Atlas Cloud where you can view the schema of your database:

4. View the schema in the Atlas UIโ€‹

Click on the link provided in the logs to view the schema in the Atlas UI.

Amazing! We have set up continuous schema monitoring for our ClickHouse database using Atlas and GitHub Actions. The GitHub Action will run every 4 hours, ensuring that the schema documentation is always up-to-date, you can adjust the schedule to fit your needs or run the workflow manually on-demand.

Wrapping Upโ€‹

We hope you find this new integration useful! As always, we would love to hear your feedback and suggestions on our Discord server.

Additional Readingโ€‹

The Hidden Bias of Alembic and Django Migrations (and when to consider alternatives)

ยท 9 min read
Rotem Tamir
Building Atlas

Python has been a top programming language for the past decade, known for its simplicity and rich ecosystem. Many companies use it to build web apps and server software, thanks to frameworks like Django and SQLAlchemy.

One of the most common (and often loathed) tasks when building backend applications is managing the database schema. As the app's data model evolves, developers need to modify the database schema to reflect those changes. This is where database schema migration tools come into play.

Why devs โค๏ธ Django Migrations and Alembicโ€‹

As far as migration tools go, SQLAlchemy and Django have both built out robust solutions for managing database schema through Alembic and Django Migrations, which stand out as some of the best in the field. They have both been around for a while, becoming popular due to their code-first approach:

  1. First, developers define their database schema as code through Python classes, which are also used at runtime to interact with the database.
  2. Next, the migration tool automatically generates the necessary SQL to apply those changes to the database.

For most projects, these tools work well and are a great fit, but there are some cases where you should consider looking at a specialized schema management tool. In this article, we'll explore some of the limitations of ORM-based migration tools and present Atlas, a database schema-as-code tool that integrates natively with both Django and SQLAlchemy.

The bias of ORM-based migrationsโ€‹

ORMs are commonly distributed with schema management tools. Without a way to set up the database schema, the ORM cannot function, so each ORM must include something that provides a viable developer experience.

The main purpose of ORMs is to abstract the database layer and deliver a roughly uniform experience across different systems (e.g., PostgreSQL, MySQL, SQL Server, etc.). As an abstraction layer, they tend to concentrate on the shared database features (such as tables, indexes, and columns) rather than on more advanced, database-specific capabilities.

Being ORM maintainers ourselves (the team behind Atlas maintains Ent), we can attest that in our capacity as ORM authors, migrations are seen as a "necessary evil", something we have to ship, but really is just an annoying requirement. ORMs exist to bridge code and DB - they are a runtime effort, not a CI/CD or resource management concern.

As such, ORM migration tools tend to be basic, suitable just for the common cases of reading and writing data from tables. In projects that require a more involved schema management process, you might want to consider using a specialized tool like Atlas.

Advanced database featuresโ€‹

ORMs scratch the tip of the iceberg of database features

In many systems, the database is viewed simply as a persistence layer, but databases are capable of much more than just storing data. In recent years, there is a growing trend of utilizing databases for more than just CRUD operations. For example, you might want to use your database for:

  • Stored Procedures, Functions, and Triggers: Logic can be encapsulated in stored procedures or triggers that automatically execute on certain events (e.g., inserts, updates, or deletes). This ensures consistent data validation, auditing, or transformation at the database level.
  • Views and Materialized Views: Views are virtual tables generated from a SELECT query, while materialized views store the query results in a physical table that can be indexed. Materialized views can boost performance for frequently accessed or computationally expensive queries.
  • Custom Data Types: Some databases (e.g., PostgreSQL) allow defining custom data types for domain-specific constraints or storing complex structures that exceed standard built-in types.
  • Extensions: Many databases support extensions that add specialized capabilities. For example, PostGIS (an extension for PostgreSQL) provides advanced geospatial data handling and queries.
  • Row-Level Security (RLS): RLS lets you define policies to control which rows a user can see or modify. This is particularly useful for multi-tenant systems or sensitive data scenarios where granular, row-level permissions are required.
  • Sharding: Sharding distributes data across multiple database nodes (or clusters). This approach can enhance performance, fault tolerance, and scalability, especially in high-traffic, large-volume applications.
  • Enumerated Types (ENUM): ENUM types allow you to define a constrained list of valid values for a column (e.g., "small", "medium", "large"). This can help enforce data consistency and prevent invalid entries.

Where Atlas comes inโ€‹

ORMs typically do not provide a way to manage these advanced database features.

Using Atlas, ORM users can keep using their favorite ORM (e.g SQLAlchemy) but also extend their data model with advanced database features. This is done by utilizing composite_schema, a feature that allows you to define your schema in multiple parts, each part using a different schema source. For example:

data "external_schema" "sqlalchemy" {
program = [
"atlas-provider-sqlalchemy",
"--path", "./models",
"--dialect", "postgresql"
]
}

data "composite_schema" "example" {
// First, load the schema with the SQLAlchemy provider
schema "public" {
url = data.external_schema.sqlalchemy.url
}
// Next, load the additional schema objects from a SQL file
schema "public" {
url = "file://extra_resources.sql"
}
}

env "local" {
src = data.composite_schema.example.url
// ... other configurations
}

In this example, we define a composite schema that combines the SQLAlchemy schema with additional schema objects loaded from a SQL file. This allows you to use the full power of your database while still benefiting from the convenience of ORMs.

Using composite schemas, we can use SQLAlchemy to define a base table and then use a SQL file to define a materialized view that aggregates data from it for faster querying. For instance, let's define a SQLAlchemy model for a user account:

class User(Base):
__tablename__ = "user_account"
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(30))
team_name: Mapped[Optional[str]] = mapped_column(String(30))
points: Mapped[int] = mapped_column(Integer)

Then use plain SQL to define a materialized view that aggregates the total points per team:

CREATE MATERIALIZED VIEW team_points AS
SELECT team_name, SUM(points) AS total_points
FROM user_account
GROUP BY team_name;

Atlas will read both the SQLAlchemy model and the SQL file and generate any necessary SQL migration scripts to apply the changes to the database.

CI/CD Pipelinesโ€‹

Although it happens more frequently than you might hope, database schema migrations should not be executed from a developer's workstation. Running migrations in such a way is error-prone and can lead to inconsistencies between your codebase and the database schema.

Instead, migrations should be applied as part of your CI/CD pipeline. This ensures that only code that was reviewed, approved and merged is deployed to production. Additionally, it reduces the need to grant developers direct access to the production database, which can be a security and compliance risk.

Django and SQLAlchemy are unopinionated about how you run migrations in your CI/CD pipeline. They provide the basic building blocks (e.g., manage.py migrate for Django) and leave it up to you to integrate them into your pipeline.

For simple use-cases, this is fine. But as your project grows, you might find yourself needing more control over the migration process. For example, you might want to:

  1. Automate code review. Automatically verify that migrations are safe to apply before running them. Integrating automatic checks into your CI/CD pipeline can help catch issues early and prevent bad migrations from being applied.

  2. Integrate with CD systems. As systems evolve, organizations often adapt more complex deployment strategies that require advanced tooling (e.g GitOps, Infrastucture as Code). Integrating migrations natively into these systems requires a substantial engineering effort (e.g, writing a Kubernetes Operator or Terraform provider).

  3. Monitor schema drift. As much as we'd like to believe that production environments are air-tight, and never touched by human hands, the reality is that things happen. Monitoring schema drift can help you catch unexpected changes to your database schema and take corrective action before they cause issues.

Atlas ships with native integrations for popular CI/CD systems like GitHub Actions, GitLab CI, BitBucket Pipelines, Kubernetes, Terraform, and more. This allows you to easily integrate schema management into your existing CI/CD pipelines without having to write brittle custom scripts or plugins.

One migration tool to rule them allโ€‹

If your company's tech stack is uniform and everyone is using the same ORM and database system, you might not be worried about the need to standardize on a single migration tool, but as companies grow, the tech stack can become more diverse.

This is especially true when adopting a microservices architecture as different teams might be using different ORMs, languages or database systems. While this is great for flexibility, it can make it can make it very difficult for platform, security, and compliance functions to ensure that all teams are following the same best practices. This is where choosing a single migration tool can help.

Atlas is designed to be a universal schema management tool that can be used across different ORMs, languages, and database systems. It provides a consistent experience for managing database schema, regardless of the underlying technology stack.

By providing a plugin system that can provide bindings to different ORMs and database systems, Atlas can be to be the common denominator for schema management across your organization.

Conclusionโ€‹

Django Migrations and Alembic are great tools that have served Python developers well. They make schema changes simple and work seamlessly with their respective ORMs. But ORMs focus on abstracting databases, not managing them.

For teams that need more advanced database features, better CI/CD integration, or consistency across multiple stacks โ€” a dedicated schema management tool like Atlas can help. It works alongside ORMs, letting developers define schema as code while keeping full control over the database.

If you're running into the limits of ORM-based migrations, give Atlas a try!

Atlas v0.31: Custom schema rules, native pgvector support and more

ยท 7 min read
Rotem Tamir
Building Atlas

Hey everyone!

Welcome to the second Atlas release of 2025, v0.31! We're excited to share the latest updates and improvements with you. In this release you will find:

  • Custom schema rules: You can now define custom rules for your database schema and have Atlas enforce them for you during CI.
  • pgvector support: We've added support for managing schemas for projects that use the LLM-based pgvector extension.
  • Drift detection: It is now simpler to set up drift detection checks to alert you when a target database isn't in the state it's supposed to be in.
  • Multi-project ER Diagrams: you can now create composite ER diagrams that stitch schema objects from multiple Atlas projects.

Enforce custom schema rulesโ€‹

The schema linting rules language is currently in beta and available for enterprise accounts and paid projects only. To start using this feature, run:

atlas login

Atlas now supports the definition of custom schema rules that can be enforced during local development or CI/CD pipelines. This feature is particularly useful for teams that want to ensure that their database schema adheres to specific conventions or best practices, such as:

  • "Require columns to be not null or have a default value"
  • "Require foreign keys to reference primary keys of the target table and use ON DELETE CASCADE"
  • "Require an index on columns used by foreign keys"
  • ... and many more

The linted schema can be defined in any supported Atlas schema format, such as HCL, SQL, ORM, a URL to a database, or a composition of multiple schema sources.

Editor Support

Schema rule files use the .rule.hcl extension, and supported by the Atlas Editor Plugins.

Here's a simple example of a schema rule file:

schema.rule.hcl
# A predicate that checks if a column is not null or has a default value.
predicate "column" "not_null_or_have_default" {
or {
default {
ne = null
}
null {
eq = false
}
}
}

rule "schema" "disallow-null-columns" {
description = "require columns to be not null or have a default value"
table {
column {
assert {
predicate = predicate.column.not_null_or_have_default
message = "column ${self.name} must be not null or have a default value"
}
}
}
}

The example above defines a rule that checks if all columns in a schema are either not null or have a default value.

To use it, you would add the rule file to you atlas.hcl configuration:

atlas.hcl
lint {
rule "hcl" "table-policies" {
src = ["schema.rule.hcl"]
}
}

env "local" {
src = "file://schema.lt.hcl"
dev = "sqlite://?mode=memory"
}

Suppose we used the following schema:

schema.lt.hcl
schema "main" {
}

table "hello" {
schema = schema.main
column "id" {
type = int
null = true
}
}

Next, Atlas offers two ways to invoke the linting process:

  1. Using the migrate lint command in the Atlas CLI. This lints only new changes that are introduced in the linted changeset - typically the new migration files in the current PR.
  2. Using the newly added schema lint command in the Atlas CLI. This lints the entire schema, including all tables and columns, and can be used to enforce schema rules during CI/CD pipelines.

Let's use the schema lint command to lint the schema:

atlas schema lint --env local -u env://src

The command will output the following error:

Analyzing schema objects (3 objects in total):

rule "disallow-null-columns":
-- schema.lt.hcl:7: column id must be not null or have a default value

-------------------------
-- 27.791ยตs
-- 1 diagnostic

Great! Atlas has detected that the id column in the hello table is nullable and has raised an error. You can now update the schema to fix the error and rerun the linting process.

For more information on defining custom schema rules, check the docs for custom schema rules.

Manage schemas for LLM-based projects with pgvectorโ€‹

pgvector has become the de-facto standard for storing high-dimensional vectors in PostgreSQL. It is widely used in LLM-based projects that require fast similarity search for large datasets. Use cases such as recommendation systems and RAG (Retrieval-Augmented Generation) pipelines rely on pgvector to store embeddings and perform similarity searches.

In this release, we've added support for managing schemas for projects that use pgvector. You can now use Atlas to automate schema migrations for your pgvector projects.

Suppose you are building a RAG pipeline that uses OpenAI embeddings to generate responses to user queries. Your schema might look like this:

schema "public" {
}

extension "vector" {
schema = schema.public
}

table "content" {
schema = schema.public
column "id" {
type = int
}
column "text" {
type = text
}
column "embedding" {
type = sql("vector(1536)")
}
primary_key {
columns = [ column.id ]
}
index "idx_emb" {
type = "IVFFlat"
on {
column = column.embedding
ops = "vector_l2_ops"
}
storage_params {
lists = 100
}
}
}

Notice three key elements in the schema definition:

  1. The extension block defines the vector extension, enabling pgvector support for the schema.
  2. The column block defines a column named embedding with the vector(1536) type, which stores 1536-dimensional vectors.
  3. The index block defines an index on the embedding column, which uses the IVFFlat method for similarity search.

For more information on defining pgvector powered indexes, check the docs for index storage parameters.

Simpler setup for Drift Detectionโ€‹

Schema drift, which refers to a state where there is a difference between the intended database schema and its actual state, can lead to significant issues such as application errors, deployment failures, and even service outages.

In this release, we've simplified the process of setting up drift checks using the Atlas Cloud UI.

If you already use Atlas to run migrations, enable drift detection to stay on top of any manual schema changes that may cause a painful outage.

Watch the demo:

Multi-project ER Diagramsโ€‹

Many teams using Atlas adopt microservices-oriented architectures to build their applications. This approach allows them to develop, scale, and deploy services independently, but it also introduces a unique challenge: database schemas are often spread across multiple logical (and sometimes physical) databases.

While these architectures excel at modularity (by decoupling teams from one another) they make it harder to visualize, manage, and reason about the overall data model. Traditional Entity-Relationship (ER) diagrams are designed for monolithic databases, where all relationships and constraints exist within a single system. However, in a microservices setup, relationships between entities often span across services, sometimes without explicit foreign keys or enforced constraints.

To help teams overcome this challenge, we are happy to announce today the availability of composite, multi-project ER diagrams in Atlas Cloud. Using this feature, teams can create diagrams that are composed of database objects from multiple Atlas projects.

See it in action:

Over years of development, database schemas tend to become complex and may grow to encompass hundreds or even thousands of objects. Using the Atlas data modeling tool, users can curate smaller subgraphs of the schema, named โ€œsaved filtersโ€, which can serve as documentation for a certain aspect of the system. To make this process easier, our team has added a new context menu that makes it easier to include connected objects (e.g tables that are connected via a Foreign Key) in saved filters.

Atlas ER Diagram Related Object Selection

Watch the example:

New Trust Centerโ€‹

At Ariga, the company behind Atlas, we are committed to meeting the evolving security and compliance needs of our customers. As part this commitment we've consolidated our legal and compliance documentation into a single page in our new Trust Center.

Wrapping Upโ€‹

We hope you enjoy the new features and improvements. As always, we would love to hear your feedback and suggestions on our Discord server.

Simplified Schema Monitoring, Drizzle support, Bitbucket, and more

ยท 6 min read
Rotem Tamir
Building Atlas

Happy new year everyone, and welcome to our first release of 2025, Atlas v0.30! We have some exciting new features and improvements to share with you.

In this release you will find:

  1. Simplified Schema Monitoring: Previously you needed to install a long-running agent on your database VPC to monitor your schema. Schema monitoring is now even simpler with the introduction of a new agentless monitoring mode.
  2. Drizzle Support: We now support Drizzle, a popular ORM for Node.js. You can now use Atlas to automate schema migrations for your Drizzle projects.
  3. Bitbucket Pipelines: We have added support for Bitbucket Pipelines, making it easier to integrate Atlas into your Bitbucket CI/CD workflows.
  4. Custom Kubernetes Configurations: Atlas Pro users can now provide custom atlas.hcl configurations for their Kubernetes Custom Resource Definitions (CRDs) using the Atlas Operator.
  5. txtar Editor Support: The Atlas JetBrains plugin now supports editing txtar files, used by the Atlas CLI to define pre-migration checks but also useful for other purposes.

Simplified Schema Monitoringโ€‹

We released Atlas Schema Monitoring a few months ago to enabled teams to track changes to the schema of any database. Previously, this workflow required installing a long-running agent process on your infrastructure.

To simplify things further, you can now run schema monitoring workflows directly from your GitHub Actions pipelines.

Learn how in this quickstart guide, or watch the video:

Drizzle Supportโ€‹

Following popular demand from the Drizzle community, we are excited to announce our official guide on using Atlas to manage database schemas for Drizzle projects.

Drizzle already provides a powerful migration tool as part of the drizzle-kit CLI, which handles many schema management needs seamlessly. However, because it is deeply integrated with the Drizzle ORM, there are scenarios where a standalone schema management solution like Atlas might be a better fit.

In collaboration with the Drizzle team, weโ€™re thrilled to highlight a brand-new feature introduced in drizzle-kit: the drizzle-kit export command. This feature, developed in partnership with our team, allows you to easily export your existing schema for use with Atlas.

To learn more, check out our Drizzle guide.

Bitbucket Pipesโ€‹

Atlas is designed to integrate seamlessly with your CI/CD workflows. Today we are excited to announce that we natively support Bitbucket Pipes, making it easier to automate your database schema management tasks with Atlas.

Using the new Bitbucket integration, users can easily perform schema management related tasks, such as running migrations, verifying schema changes, and monitoring schema drift, directly from their Bitbucket Pipelines.

For example, you can use the arigaio/atlas-action pipe to run migrations on your target database:

image: atlassian/default-image:3
pipelines:
branches:
master:
- step:
name: "Applies a migration directory to a target database"
script:
- name: "Migrate Apply"
pipe: docker://arigaio/atlas-action:master
variables:
ATLAS_ACTION: "migrate/apply" # Required
ATLAS_INPUT_URL: ${DATABASE_URL}
ATLAS_INPUT_DIR: "file://migrations"
- source .atlas-action/outputs.sh

To learn more, check out our Bitbucket Pipes guide.

Custom Kubernetes Configurationsโ€‹

The Atlas Operator now supports custom configurations for your Kubernetes CRDs. This feature is available to Atlas Pro users and enables use cases like loading credentials from an external secret store or defining custom linting rules for verifying the safety of your schema changes.

To enable this feature, install the latest version of the Atlas Operator Helm chart using the allowCustomConfig flag. Then, provide your custom atlas.hcl configuration file in the spec.config field of your Atlas CRD:

atlas-schema.yaml
apiVersion: db.atlasgo.io/v1alpha1
kind: AtlasSchema
metadata:
name: sample
spec:
envName: "test"
schema:
sql: |
create table users (
-- redacted for brevity
);
cloud:
repo: atlas-operator
tokenFrom:
secretKeyRef:
name: atlas-token
key: ATLAS_TOKEN
vars:
- key: "db_url"
valueFrom:
secretKeyRef:
name: schema-db-creds
key: url
- key: "dev_db_url"
valueFrom:
configMapKeyRef:
name: schema-db-dev-creds
key: url
config: |
variable "db_url" {
type = string
}
variable "dev_db_url" {
type = string
}
env "test" {
url = var.db_url
dev = var.dev_db_url
}

For more information, check out the Atlas Operator documentation.

txtar Editor Supportโ€‹

A few months ago, we shared our journey of building a robust testing framework for the Atlas CLI, in a blog post titled How Go Tests go test (originally created for GopherCon Israel 2024).

The post got a lot of attention in many Go communities (Including Golang Weekly #552) as it demonstrated how to utilize the Go team's internal tool for testing the Go toolchain itself.

Even more exciting, it seems that the post has stirred some interest in the Go community, with some fairly large projects adopting the same approach for their testing frameworks (see acceptance testing for the GitHub CLI and Cilium).

Over the past few years using this method for testing the Atlas CLI, our team has grown very fond of writing tests in txtar format. We found it to be a very expressive and concise way to define test cases, and it has become a core part of our testing strategy. Our appreciation for txtar has led us to adopt it in other parts of our tooling, including making it the way to define pre-migration checks in Atlas.

To make it easier for our users (and other teams using txtar for other purposes) to work with txtar files, we have added support for editing txtar files in the Atlas JetBrains plugin. This feature is available in the latest version of the plugin, which you can download from the JetBrains plugin repository.

In addition to txtar support, this release also includes support for HCL code formatting, making it easier than ever to write and maintain your Atlas schemas. To see both features in action, check out the video below:

More News and Updatesโ€‹

  • SOC2 Compliance. We have recently completed our SOC2 re-certification for the third year in a row. This certification demonstrates our commitment to providing a secure and reliable infrastructure for our users and customers. You can read more about this in our recent blogpost.
  • Podcast Appearances. Catch Atlas on some recent pod cast episodes in: Kube.fm, The IaC Podcast, and Amazic.

Wrapping Upโ€‹

We hope you enjoy the new features and improvements. As always, we would love to hear your feedback and suggestions on our Discord server.

Atlas is now SOC2 Certified for 2024

ยท 3 min read
Rotem Tamir
Building Atlas

Today we are happy to announce that Atlas has achieved SOC2 compliance for the third year in a row. This is an important milestone for us, demonstrating our commitment to providing a solid infrastructure for our users and customers.

soc2-atlas-ariga-compliance

As a company that is trusted by its customers to handle mission-critical databases, we are committed to ensuring the highest standards of security, availability, and confidentiality. Achieving SOC 2 compliance demonstrates our dedication to safeguarding customer data, maintaining trust, and adhering to industry best practices.

Our commitment to process only metadataโ€‹

commitment to not process records from your database

Control 70. Our commitment to not process records from your database. Screenshot from Ariga's full SOC2 audit report.

As anyone in the compliance domain knows, audits are about setting a high bar and then building controls to ensure they are met throughout our day-to-day operations. While these audits often address external requirements, such as regulatory mandates, they also serve as an opportunity to build trust with customers by addressing critical areas of concern within the company.

This year, we chose to use our audit process to address a common question from customers regarding how we handle their data. As a schema management tool, Atlas interacts with critical and sensitive data assets of our customers. This naturally raises concerns for compliance and security teams, as they are entrusted with protecting data on behalf of their own customers.

Atlas, and Atlas Cloud, our SaaS offering, are designed with a foundational principle: we do not store, send, or process user dataโ€”only metadata. We have consistently communicated this commitment to compliance teams, and after thorough discussions and reviews, they have been satisfied with this approach. However, this year we decided to formalize this commitment within our compliance framework.

As part of our SOC 2 audit, we introduced Control #70 which states: "The company does not process or store records from the customer's managed databases, but only handles information schema and metadata related to them."

By incorporating this control, we have established a clear, auditable process that reinforces our promise to our customers and ensures that this principle remains at the core of how we operate moving forward.

To summarize, achieving SOC 2 compliance for the third year reflects our core belief as engineers: security, privacy, and automation should drive auditable processes. SOC 2 provides the framework to solidify these principles into a trusted, transparent process for our customers.

If your compliance team, needs access to our report or other documents, drop us a line!

Case Study: How Unico's Platform Engineering Team Closed the DevOps/Databases Gap Using Atlas

ยท 6 min read
Rotem Tamir
Building Atlas

"Month over month, we see smaller and smaller incidents.", Luiz Casali, Sr Engineering Manager

Company Backgroundโ€‹

Unico is a leading digital identity technology provider in Brazil, developing secure and efficient digital identity solutions for businesses and individuals. Their platform helps organizations streamline identity verification processes, delivering a seamless user experience while enhancing security and reducing fraud.

The Missing Layer: Tackling Complexity, Outages, and Risks in Database Schema Managementโ€‹

At Unico, the Platform Engineering team, led by Luiz Casali, is focused on improving developer productivity. "Reducing complexity for developers is one of our top priorities," Luiz explained.

Unico's Platform team had previously built solutions to automate CI/CD workflows for code using Bazel and GitHub Actions and for infrastructure using Terraform and Atlantis. The team was missing a standardized solution for managing database schema changes.

cicd-database-gap

This gap introduced several pressing issues:

  1. Risky Manual Processes: Database schema changes (migrations) were performed manually, increasing the chance of human error.
  2. Unreliable Deployments: Unplanned, database-related outages were common, emphasizing the need for a safer way to handle database changes.
  3. Compliance Requirements: The team needed to document and review schema changes to maintain governance standards, but the lack of automation made this challenging.

Determined to bridge this gap and establish a safer, more efficient solution for developers, Unico's Platform Engineering team began researching the best database migration tools. Thiago da Silva Conceiรงรฃo, a Site Reliability Engineer (SRE) in the team, took the lead on this technical evaluation.

The Challenge of Managing Database Schema Migrationsโ€‹

Traditional schema migration tools posed significant challenges for Unico. As Thiago noted, "Automation with our previous tool was never easy to maintain. It followed a specific pattern, and only a few team members were familiar with it." The team faced limitations that affected usability, integration, and overall productivity:

  • Limited Usability and Adoption: The tool required extensive knowledge, and documentation was limited, making it difficult to adopt across the team.
  • Lack of Automation Support: Automated migrations and reliable error-handling were lacking, leading to inconsistent deployments and a need for additional manual oversight.
  • Compliance Difficulties: The absence of automated documentation and governance features made it challenging to maintain and provide records for audit and compliance requirements.

With these challenges in mind, Unico needed a solution that could offer usability, integration with existing tools, and comprehensive metrics to continuously monitor and improve database migrations.

Evaluating Alternatives and Choosing Atlasโ€‹

"In the end, choosing Atlas was easy. It is a simple, yet powerful tool, offering a significant impact with many ready-made features that would require customization in other tools."
Thiago Silva Conceiรงรฃo, SRE, Unico

During the search for a new solution, Unico's engineering team prioritized several criteria:

  1. Ease of Use: The tool needed to be straightforward and accessible for all developers, not just a few specialized team members.
  2. Integration and Compatibility: It had to fit naturally with Unico's technology stack, particularly with Terraform, which was already in heavy use.
  3. Metrics and Alerts: Real-time insights and alerts were essential to monitor migrations effectively.

Thiago compared a few traditional solutions before selecting Atlas. Atlas's declarative schema-as-code approach, along with its HCL compatibility and robust cloud management, aligned well with Unico's needs. It allowed the team to automate migrations, eliminate manual errors, and centralize schema management, creating a unified experience across their projects. "Atlas allowed us to keep the DDL in HCL while still supporting SQL scripts for specific use cases through its versioning model," Thiago shared.

One Migration Tool to Rule Them Allโ€‹

schema-migration-tool

Another key priority for Unico's Platform Engineering team was standardization. With multiple teams working across diverse programming languages and databases, The Platform Engineering team needed a unified migration tool that would work for a wide array of use cases, without sacrificing ease of use or reliability. To simplify the developer experience and streamline internal operations, they aimed to find a single solution that could support all teams consistently and seamlessly.

Atlas emerged as the ideal fit by providing plugin support for various databases, ORMs and integrations, making it a flexible tool for Unico's entire tech stack. The ability to standardize migration management with Atlas allowed Unico's Platform Engineering team to enforce consistent practices across all projects. Atlas became the single source of truth for schema management, offering a centralized framework for building policies, integrating into CI/CD pipelines, and supporting developers.

By implementing Atlas as a standard, the Platform Engineering team eliminated the need to train on or maintain multiple tools, reducing complexity and operational overhead. Now, Unico's developers enjoy a unified experience, and the platform team has only one tool to integrate, support, and scale as the company grows.

Implementation: A Smooth Migration Processโ€‹

The migration to Atlas was seamless, with no need to recreate migration files or impose rigid formats. "We simply imported the schemas from the production database, keeping the process straightforward and efficient," Thiago said. The team was able to quickly onboard Atlas and start seeing results, with pre-built actions in Atlas Cloud providing essential metrics, notifications, and dashboards for tracking progress.

This success reinforced the decision to adopt Atlas:

"Month over month, we see smaller and smaller incidents.

โ€” Luiz Casali, Senior Engineering Manager

Outcome: Faster Development Cycles, Increased Reliability, and Enhanced Complianceโ€‹

With Atlas in place, Unico's Platform Engineering team has achieved several key outcomes:

  • Accelerated Development Cycles: Automation of database migrations streamlined the development process, enabling faster iterations and more rapid deployments.
  • Increased Reliability: Atlas's linting and testing tools reduced migration errors and enhanced deployment stability, contributing to Unico's goal of reducing incidents.
  • Enhanced Compliance: Atlas's automated documentation ensures that each migration step is recorded, simplifying compliance by providing a clear, auditable record of all schema changes.

By automating these processes, the team has successfully reduced manual work and achieved a more predictable migration workflow. Now, as Unico grows, they are assured that their migration practices will scale smoothly, maintaining operational costs without sacrificing speed or reliability.

ci-cd-database-gap-close

Getting Startedโ€‹

Atlas brings the declarative mindset of infrastructure-as-code to database schema management, similar to Terraform, but focused on databases. Using its unique schema-as-code approach, teams can quickly inspect existing databases and get started with minimal setup.

Like Unico, we recommend anyone looking for a schema migration solution to get started with Atlas by trying it out on one or two small projects. Dive into the documentation, join our Discord community for support, and start managing your schemas as code with ease.

New Release: Approval flows for Kubernetes, Prisma support, and more!

ยท 5 min read
Rotem Tamir
Building Atlas

Hey everyone!

We are excited to announce the release of Atlas v0.29, which continues our journey to make working with database easier, safer and more reliable. This release includes several significant updates that we are happy to share with you:

  • Approval flows for the Kubernetes Operator: Moving to a declarative way of managing database schemas has plenty of advantages, but many teams want to see and approve changes before they are applied. Doing this from the CLI is straightforward, but until recently it was not easy to provide this experience in Kubernetes-based workflows.

    With the new approval flows, you can now review and approve schema migrations seamlessly, ensuring database changes are well-governed while maintaining deployment velocity.

  • Prisma support: Following our integrations with some of the most popular ORMs in our industry, we are happy to announce our official guide on using Atlas to manage database schemas for Prisma projects.

  • GitLab CI/CD Components: Integrating GitLab CI with Atlas just got much easier with the new GitLab CI/CD components.

  • IntelliJ Plugin: Our IntelliJ plugin has been upgraded with code folding, inline SQL syntax highlighting and suggestions, and syntax highlighting within heredoc clauses.

  • Timeseries Engine support for ClickHouse: ClickHouse users can now explore beta support for timeseries data in Atlas.

  • Constraint Triggers support for PostgreSQL: PostgreSQL users can now manage constraint triggers with Atlas.

Approval Flows for the Kubernetes Operatorโ€‹

Moving to a declarative way of managing database schemas has plenty of advantages, but many teams want to see and approve changes before they are applied.

Providing flows for keeping a human-in-the-loop from the CLI is straightforward, but until recently it was not easy to provide this experience in Kubernetes-based workflows.

Following our recent KubeCon session, the Atlas Operator now includes approval flows for declarative schema migrations, making database changes in Kubernetes safer:

  1. Pre-approvals - with pre-approvals, teams enhance their CI pipelines to detect schema changes and integrate their planning and approval in the code review process. The approved policies are then applied to the database by the operator.
  2. Ad-hoc Approvals - with ad-hoc approvals, the operator pauses the migration process and waits for human approval before applying the schema change. This is useful for schema changes that were not approved in advance or for projects that do not have a strict pre-approval policy.

Prisma Supportโ€‹

Following popular demand from the Atlas community, we are excited to announce our official guide on using Atlas to manage database schemas for Prisma projects.

Prisma already has an excellent migration tool called prisma migrate, so why would you want to use Atlas with Prisma? In many cases, Prisma's migrate indeed checks all the boxes for managing your database schema. However, being tightly coupled with the Prisma ORM, some use cases might require a dedicated schema management tool that can be used across different ORMs and frameworks.

This guide shows how to load your existing prisma.schema file into Atlas, manage your schema changes, and apply them to your database using the Atlas CLI.

Interested in learning more? Read the guide!

Gitlab CI/CD Componentsโ€‹

Integrating GitLab CI with Atlas just got much easier with our new GitLab CI/CD components.

GitLab CI/CD components are reusable YAML templates that you can use in your GitLab CI/CD pipelines to automate workflows within your GitLab project. Our newly published components are designed to simplify the process of integrating Atlas with GitLab CI/CD pipelines, enabling you to automate database schema management tasks with ease.

Want to learn more? Read the tutorial.

IntelliJ Pluginโ€‹

Our IntelliJ plugin has been upgraded with code folding, inline SQL syntax highlighting and suggestions, and syntax highlighting within heredoc clauses. Our goal with these efforts is to make writing real world database applications with Atlas easier and more enjoyable.

If you use JetBrains editors, be sure to download the most recent version.

Timeseries Data Support for ClickHouseโ€‹

ClickHouse recently added support for an experimental TimeSeries engine, which is designed to optimize storage and query performance for time-series data.

Atlas now supports this experimental feature, enabling ClickHouse users to manage schemas for their time-series data tables with ease:

You can simply define a TimeSeries table in your Atlas schema.

table "example" {
schema = schema.public
engine = TimeSeries
}

PostgreSQL Constraint Triggersโ€‹

The CONSTRAINT TRIGGER is a PostgreSQL extension of the SQL standard, which works like a regular trigger but allows its execution time to be dynamically controlled using the SET CONSTRAINTS command.

Starting with this version, users can define constraint triggers, and Atlas will manage their lifecycles. Their definitions are also supported in the Atlas HCL syntax:

trigger "users_insert" {
on = table.users
constraint = true
before {
insert = true
}
// ...
}

trigger "groups_insert" {
on = table.users
constraint = true
deferrable = INITIALLY_DEFERRED
before {
insert = true
}
// ...
}

Read more about the Storage Engine

Wrapping Upโ€‹

We hope you enjoy the new features and improvements. As always, we would love to hear your feedback and suggestions on our Discord server.

The Hard Truth about GitOps and Database Rollbacks

ยท 16 min read
Rotem Tamir
Building Atlas

Prepared for KubeCon North America 2024

Introductionโ€‹

For two decades now, the common practice for handling rollbacks of database schema migrations has been pre-planned "down migration scripts". A closer examination of this widely accepted truth reveals critical gaps that result in teams relying on risky, manual operations to roll back schema migrations in times of crisis.

In this post, we show why our existing tools and practices cannot deliver on the GitOps promise of "declarative" and "continuously reconciled" workflows and how we can use the Operator Pattern to build a new solution for robust and safe schema rollbacks.

The Undo Buttonโ€‹

One of the most liberating aspects of working on digital products is the ability to roll back changes. The Undo Button, I would argue, is one of the most transformative features of modern computing.

Correcting mistakes on typewriters was arduous. You would roll the carriage back and type over any errors, leaving messy, visible corrections. For bigger changes, entire pages had to be retyped. Correction fluid like whiteout offered a temporary fix, but it was slow and imprecise, requiring careful application and drying time.

Digital tools changed everything. The Undo Button turned corrections into a simple keystroke, freeing creators to experiment without fear of permanent mistakes. This shift replaced the stress of perfection with the freedom to try, fail, and refine ideas.

Rollbacks and Software Deliveryโ€‹

When it comes to software delivery, having an Undo Button is essential as well. The ability to roll back changes to a previous state is a critical safety net for teams deploying new features, updates, or bug fixes. Specifically, rollbacks impact one of the key metrics of software delivery: Mean Time to Recovery (MTTR).

MTTR is a measure of how quickly a system can recover from failures. When a deployment fails, or a bug is discovered in production, teams generally have two options: triage and fix the issue (roll forward), or roll back to a previous known stable state.

When the fix to an issue is not immediately clear, or the issue is severe, rolling back is often the fastest way to restore service. This is why having a reliable rollback mechanism is crucial for reducing MTTR and ensuring high availability of services.

How are rollbacks even possible?โ€‹

Undoing a change in a local environment like a word processor is straightforward. There are multiple ways to implement an Undo Button, but they all rely on the same basic principle: the system keeps track of every change made and can revert to a previous state.

In a distributed system like modern, cloud-native applications, things are not so simple. Changes are made across multiple components with complex dependencies and configurations.

The key capability that enables rolling back changes is described in the seminal book, "Accelerate: The Science of Lean Software and DevOps". The authors identify "Comprehensive Configuration Management" as one of the key technical practices that enables high performance in software delivery:

"It should be possible to provision our environments and build, test, and deploy our software in a fully automated fashion purely from information stored in version control.โ€ 1

In theory, this means that if we can store everything there is to know about our system in version control, and have an automated way to apply these changes, we should be able to roll back to a previous state by simply applying a previous commit.

GitOps and Rollbacksโ€‹

The principle of "Comprehensive Configuration Management" evolved over the years into ideas like "Infrastructure as Code" and "GitOps". These practices advocate for storing all configuration and infrastructure definitions in version control in a declarative format, and using automation to apply these changes to the system.

Projects like ArgoCD and Flux have popularized the GitOps approach to managing Kubernetes clusters. By providing a structured way to define the desired state of your system in Git (e.g., Kubernetes manifests), and automatically reconciling the actual state with it, GitOps tools provide a structured and standardized way to manage satisfy this principle.

On paper, GitOps has finally brought us a working solution for rollbacks. Revert the commit that introduced the issue, and all of your problems are gone!

Problem solved. End of talk. Right?

The Hard Truthโ€‹

Teams that have tried to fully commit to the GitOps philosophy usually find that the promise of "declarative" and "continuously reconciled" workflows is not as straightforward as it seems. Let's consider why.

Declarative resource management works exceptionally well for stateless resources like containers. The way Kubernetes handles deployments, services, and other resources is a perfect fit for GitOps. Consider how a typical deployment rollout works:

  1. A new replica set is created with the new version of the application.
  2. Health checks are performed to ensure the new version is healthy.
  3. Traffic is gradually shifted to healthy instances of the new version.
  4. As the new version proves stable, the old version is scaled down and eventually removed.

But will this work for stateful resources like databases? Suppose we want to change the schema of a database. Could we apply the same principles to roll back a schema migration? Here's what it would look like:

  1. A new database is spun up with the new schema.
  2. Health checks are performed to ensure the new schema is healthy.
  3. Traffic is gradually shifted to the new database instance.
  4. The old database is removed.

This would get the job done... but you would probably find yourself out of a job soon after.

Stateless resources are really great to manage because we can always throw out whatever isn't working for us and start fresh. But databases are different. They are stateful, and they are comprised not only of a software component (the database engine), the configuration (server parameters and schema), but also the data itself. The data itself cannot, by definition, be provisioned from version control.

Stateful resources like databases require a different approach to manage changes.

Up and Down Migrationsโ€‹

The common practice for managing schema changes in databases is to use "up" and "down" migration scripts in tandem with a migration tool (like Flyway or Liquibase). The idea is simple: when you want to make a change to the schema, you write a script that describes how to apply the change (the "up" migration). Additionally, you write a script that describes how to undo the change (the "down" migration).

For example, suppose you wanted to add a column named "short_bio" to a table named "users". Your up migration script might look like this:

ALTER TABLE users
ADD COLUMN short_bio TEXT;

And your down migration script might look like this:

ALTER TABLE users DROP COLUMN short_bio;

In theory, this concept is sound and satisfies the requirements of "Comprehensive Configuration Management". All information needed to apply and roll back the change is stored in version control.

Theory, once again, is quite different from practice.

The Myth of Down Migrationsโ€‹

After interviewing hundreds of engineers on this topic, we found that despite being a widely accepted concept, down are rarely used in practice. Why?

Naive assumptionsโ€‹

When you write a down migration, you are essentially writing a script that will be executed in the future to revert the changes you are about to make. By definition, this script is written before the "up" changes are applied. This means that the down migration is based on the assumption that the changes will be applied correctly.

But what if they are not?

Suppose the "up" migration was supposed to add two columns, the down file would be written to remove these two columns. But what if the migration was partially applied and only one column was added? Running the down file would fail, and we would be stuck in an unknown state.

Yes, some databases like PostgreSQL support transactional DDL, which means that if the migration fails, the changes are rolled back, and you end up with a state this consistent with a specific revision. But even for PostgreSQL, some operations cannot be run in a transaction, and the database can end up in an inconsistent state.

For MySQL, which does not support transactional DDL, the situation is even worse. If a migration fails halfway through, you are left with only a partially applied migration and no way to roll back.

Data lossโ€‹

When you are working on a local database, without real traffic, having the up/down mechanism for migrations might feel like hitting Undo and Redo in your favorite text editor. But in a real environment with real traffic, it is not the case.

If you successfully rolled out a migration that added a column to a table, and then decided to revert it, its inverse operation (DROP COLUMN) does not merely remove the column. It deletes all the data in that column. Re-applying the migration would not bring back the data, as it was lost when the column was dropped.

For this reason, teams that want to temporarily deploy a previous version of the application, usually do not revert the database changes, because doing so will result in data loss for their users. Instead, they need to assess the situation on the ground and figure out some other way to handle the situation.

Incompatibility with modern deployment practicesโ€‹

Many modern deployment practices like Continuous Delivery (CD) and GitOps advocate for the software delivery process to be automated and repeatable. This means that the deployment process should be deterministic and should not require manual intervention. A common way of doing this is to have a pipeline that receives a commit, and then automatically deploys the build artifacts from that commit to the target environment.

As it is very rare to encounter a project with a 0% change failure rate, rolling back a deployment is something everyone needs to be prepared for.

In theory, rolling back a deployment should be as simple as deploying the previous version of the application. When it comes to versions of our application code, this works perfectly. We pull and deploy the container image corresponding to the previous version.

This strategy does not work for the database, for two reasons:

  1. For most migration tools, down or rollback is a separate command that needs to be executed specifically. This means that the deployment machinery needs to know what the current version of the target database is in order to decide whether to migrate up or down.
  2. When we pull artifacts from a previous version, they do not contain the down files that are needed to revert the database changes back to the necessary schema - they were only created in a future commit!

These gaps mean that teams are left with two options: either they need to manually intervene to roll back the database changes, or they need to develop a custom solution that can handle the rollback in an automated way.

Down Migrations and GitOpsโ€‹

Going back to our main theme of exploring whether database rollbacks and GitOps can be compatible, let's expand on this last point.

The ArgoCD documentation suggests that the way to integrate schema migrations is to use a Kubernetes Job that executes your migration tool of choice, and to annotate the Job as a PreSync hook:

This image will typically be built as part of your CI/CD pipeline, and will contain the migration tool and the migration scripts for the relevant commit or release:

apiVersion: batch/v1
kind: Job
metadata:
name: db-migration
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: migrate
image: your-migration-image:{{ .Values.image.tag }} # Example using Helm values
restartPolicy: Never

When ArgoCD detects a new commit in the Git repository, it will create a new Job that runs the migration tool. If the migration is successful, the Job will complete successfully, and the new version of the application will be deployed.

This will work for the up migration. But what happens when you need to roll back?

Teams commonly hit the two issues we mentioned above:

  1. The deployment machinery does not know what the current version of the target database is, and therefore cannot decide whether to migrate up or down.

    Unless a team has carefully thought about this and implemented a mechanism inside the image to decide what to do, the deployment machinery will always migrate up.

  2. The image that is pulled for the rollback does not contain the down files that are needed to revert the database changes back to the necessary schema. Most migration tools will silently keep the database in the current state.

What are the implications?

  1. The database is no longer in sync with the current Git commit, violating all GitOps principles.

  2. Teams that do need to roll back the database changes are left with a manual process that requires intervention and coordination.

Operators: The GitOps Wayโ€‹

The Operator Pattern is a Kubernetes-native way to extend the Kubernetes API to manage additional resources. Operators typically ship two main components: a Custom Resource Definition (CRD) that defines the new resource type, and a controller that watches for changes to these resources and takes action accordingly.

The Operator Pattern is a perfect fit for managing stateful resources like databases. By extending the Kubernetes API with a new resource type that represents a database schema, we can manage schema changes in a GitOps-friendly way. A specialized controller can watch for changes to these resources and apply the necessary changes to the database in a way that a naive Job cannot.

The Atlas Operatorโ€‹

The Atlas Operator is a Kubernetes Operator that enables you to manage your database schemas natively from your Kubernetes cluster. Built on Atlas, a database schema-as-code tool (sometimes called "like Terraform for databases"), the Atlas Operator extends the Kubernetes API to support database schema management.

Atlas has two core capabilities that are helpful to building a GitOps-friendly schema management solution:

  1. A sophisticated migration planner that can generates migrations by diffing the desired state of the schema with the current state of the database.
  2. A migration analyzer that can analyze a migration and determine whether it is safe to apply and surface risks before the migration is applied.

Declarative vs. Versioned Flowsโ€‹

Atlas supports two kinds of flows for managing database schema changes: declarative and versioned. They are reflected in the two main resources that the Atlas Operator manages:

Declarative: AtlasSchemaโ€‹

The first resource type is AtlasSchema which is used to employ the declarative flow. With AtlasSchema, you define the desired state of the database schema in a declarative way, and the connection string to the target database.

The Operator is then responsible for generating the necessary migrations to bring the database schema to the desired state, and applying them to the database. Here is an example of an AtlasSchema resource:

apiVersion: db.atlasgo.io/v1alpha1
kind: AtlasSchema
metadata:
name: myapp
spec:
url: mysql://root:pass@mysql:3306/example
schema:
sql: |
create table users (
id int not null auto_increment,
name varchar(255) not null,
email varchar(255) unique not null,
short_bio varchar(255) not null,
primary key (id)
);

When the AtlasSchema resource is applied to the cluster, the Atlas Operator will calculate the diff between the database at url and the desired schema, and generate the necessary migrations to bring the database to the desired state.

Whenever the AtlasSchema resource is updated, the Operator will recalculate the diff and apply the necessary changes to the database.

Versioned: AtlasMigrationโ€‹

The second resource type is AtlasMigration which is used to employ the versioned flow. With AtlasMigration, you define the exact migration that you want to apply to the database. The Operator is then responsible for applying any necessary migrations to bring the database schema to the desired state.

Here is an example of an AtlasMigration resource:

apiVersion: db.atlasgo.io/v1alpha1
kind: AtlasMigration
metadata:
name: atlasmigration-sample
spec:
url: mysql://root:pass@mysql:3306/example
dir:
configMapRef:
name: "migration-dir" # Ref to a ConfigMap containing the migration files

When the AtlasMigration resource is applied to the cluster, the Atlas Operator will apply the migrations in the directory specified in the dir field to the database at url. Similarly to classic migration tools, Atlas uses a metadata table on the target database to track which migrations have been applied.

Rollbacks with the Atlas Operatorโ€‹

The Atlas Operator is designed to handle rollbacks in a GitOps-friendly way. This is where the power of the Operator Pattern really shines as it can make nuanced and intelligent decisions about how to handle changes to the managed resources.

To roll back a schema change in an ArgoCD-managed environment, you would simply revert the AtlasSchema or AtlasMigration resource to a previous version. The Atlas Operator would then analyze the changes and generate the necessary migrations to bring the database schema back to the desired state.

Advantages of the Operator Patternโ€‹

In the discussion above we kept talking about edge cases that arise when rolling back database schema changes, and concluded that they require manual consideration and intervention. What if we could automate this process?

The Operator Pattern is all about codifying operational knowledge into software. Let's consider how the Operator Pattern can be used to address the challenges we discussed:

  1. Understanding intent. The Operator can discern between up and down migrations. By comparing between the current state of the database and the desired version, the operator decides whether to go up or down.

  2. Having access to the necessary information. Contrary to a Job that only has access to the image it was built with, the Operator stores metadata about the last execution as a ConfigMap via the Kubernetes API. This metadata enables the operator to migrate down even though the current image does not information about the current state.

  3. Intelligent Diffing. Because the Operator is built on top of Atlas's Schema-as-Code engine, it can calculate correct migrations even if the database is in an inconsistent state.

  4. Safety checks. The Operator can analyze the migration and determine whether it is safe to apply. This is a critical feature that can prevent risky migrations from being applied. Depending on your policy, it can even require manual approval for specific types of changes!

Conclusionโ€‹

In this talk, we explored the challenges of rolling back database schema changes in a GitOps environment. We discussed the limitations of the traditional up/down migration approach, and how the Operator Pattern can be used to build a more robust and automated solution.

If you have any questions or would like to learn more, please don't hesitate to reach out to us on our Discord server.


Footnotesโ€‹

  1. Forsgren, Nicole, Jez Humble, and Gene Kim. Accelerate: The Science of Lean Software and DevOps. IT Revolution Press, 2018. โ†ฉ