The Missing Chapter in the Platform Engineering Playbook
Prepared for SREDay London 2025
Introductionβ
Platform engineering is rapidly emerging as a discipline aimed at reducing cognitive load for developers, enabling self-service infrastructure, and establishing best practices for building and operating software at scale. While much of the conversation focuses on CI/CD, Kubernetes, and internal developer platforms, one crucial aspect often remains overlooked: database schema management.
Despite being at the heart of nearly every application, schema changes are still a major source of friction, outages, and bottlenecks. In this post, we'll explore why database schema management deserves a dedicated chapter in the platform engineering playbook and how organizations can integrate it into their platform strategies.
The prompt that nuked the databaseβ
Let me tell you a not-so-fictional story about a developer named Alice. Alice is a backend engineer at a fast-growing
startup. One day, her manager asked her to make a small change to the database. The data engineering team was
complaining that they were seeing duplicate emails in the user table, and they suspected that the email
column did
not have a unique constraint.
Busy as she was working on the next release, Alice asked her trusty GitHub Co-pilot to assist her:
Alice: Hey Co-pilot, can you add a unique constraint to the
users
table?
Alice's team is using Sequelize, a popular ORM for Node.js, to manage their database schema. Co-pilot zipped through and added the constraint to the model:
email: {
allowNull: false,
type: DataTypes.STRING,
validate: {
isEmail: true
},
+ unique: true,
},
After adding the constraint, Alice knew she had to keep her database schema in sync with the application code. To do this, she needed to add a schema migration that would add the constraint to the database. Confident in her co-pilot's coding skills, she asked it to generate the migration for her:
Alice: Hey Co-pilot, can you please add the migration for this change?
Co-pilot generated:
'use strict';
module.exports = {
async up(queryInterface, Sequelize) {
await queryInterface.changeColumn('users', 'email', {
type: Sequelize.STRING,
allowNull: false,
unique: true // π This adds the UNIQUE constraint
});
},
async down(queryInterface, Sequelize) {
// redacted
}
};
Alice ran the migration on her local machine, and everything seemed fine. She tested the code on the staging environment, ran the migration, and everything seemed fine there as well. Alice's team reviewed the code, which seemed simple and straightforward, and approved the PR.
Confidently, she merges the code to the main branch and deploys it to production.
And then... the API starts timing out. User signups and logins grind to a halt. CPU usage on the database spikes.
Even unrelated queries on the users
table start hanging.
The Problem: Unique Constraint Validation & Table Lockingβ
Why did this happen?
Before applying the migration, PostgreSQL needs to ensure there are no duplicate emails before enforcing the constraint. If the table is large, this can take minutes or hours. During this time, PostgreSQL cannot allow queries can read or write anything from the table.
To do this, it acquires an ACCESS EXCLUSIVE
lock on users while validating. This means no reads or writes can happen
on users
until validation is complete. To make things worse, If existing transactions are holding locks, your
migration waits indefinitely, blocking everything else.
In other words, to guarantee the integrity of the unique constraint, PostgreSQL must "stop the world" for our users table and perform a full table scan, which can take quite a while. If this is a busy table (like the users table usually is), this means that requests will start piling up, draining resources from your database and servers and making things even worse.
Post-mortem: Why did this really happen?β
When something goes wrong, it's easy to blame the developer. But the truth is, Alice did everything right. She followed the best practices, tested her code, and reviewed it with her team. If the problem wasn't with Alice, where was it?
One technique for getting to the root cause of such a severe incident is the "5 Whys" technique. Let's apply it here:
- Why did the API start timing out? Because the migration locked the
users
table. - Why did the migration lock the
users
table? Because our developers shipped broken code. - Why did our developers ship broken code? Because it was approved by our quality assurance process.
- Why was it approved by our quality assurance process? Because we rely on manual code reviews and the reviewer didn't know about the issue.
- Why do we rely on manual code reviews? ... ?
Why do Platforms ignore Schema Management?β
The incident with Alice is not unique. Schema changes are a common source of outages, and yet, they are often overlooked in platform engineering discussions. Why?
I've interviewed dozens of platform engineers over the past few years, and the answer is oftentimes: "Schema migrations are an application concern, not a platform concern."
But is this really true?
Migrations are risky businessβ
Even with the best intentions, errors can still arise during the development and review of migrations, leading to downtime and data loss. Some of the failure modes that should be considered include:
- Destructive changes - The SQL standard includes
DROP
commands that can remove tables, columns, and other database objects. While powerful, these commands can lead to data loss if not used carefully. Since most databases do not feature a built-in "undo" button, it's crucial to ensure that these commands are executed with caution. - Breaking changes - Schema changes can introduce breaking changes to the application code. For example, if a column is renamed or its type is changed, the application code may need to be updated to reflect these changes. This can lead to runtime errors if not handled properly.
- Constraint violations - Changing constraints on a table can lead to constraint violations if the existing data
does not meet the new constraints. For example, adding a
NOT NULL
constraint to a column that already contains null values will cause the migration to fail. This is especially problematic since these errors are typically not detected in development and staging environments (where datasets are often limited and less diverse), but only in production. - Table locks - Some changes, as we saw in Alice's case, require the database to acquire exclusive locks on the table, preventing any reads or writes during the migration. This can lead to downtime and degraded performance, especially for large tables.
For these reasons, schema changes should be treated with the same level of care and attention as application code changes. They should be tested, reviewed, and validated automatically before being deployed to production.
It's about to get worse: What GenAI means for databasesβ
The rise of Generative AI (GenAI) is changing the way we write code and interact with databases. With tools like GitHub Copilot and ChatGPT, developers are increasingly relying on AI to assist them in writing code, including database queries and schema changes. Let's reflect on some of the implications of this shift:
-
A lot more code, with less attention - As developers increasingly rely on AI to generate code, the amount of code being written is skyrocketing. LLMs are very adept at generating text that looks sensible, but not always is.
On one extreme, this leads to hallucinations - where the AI generates code that is actually nonsensical. On the other extreme, it leads to unintended consequences - where the AI generates code that is syntactically correct, but semantically incorrect or has unintended side effects, as we saw with Alice's migration post-mortem.
Since the code that LLMs produce will often look right, it can be particularly tricky to catch issues with normal, manual code reviews. This is especially true for database migrations, where the LLM may generate correct code that does not consider the safety implications of the change.
-
Skill-level needed to build apps is going down - As GenAI tools become more prevalent, we can expect to see a wider range of people writing code, including non-engineers. With the advent of "vibe coding" - the practice of using natural language prompts to iteratively build applications, the bar for writing code is dropping.
Whether it's internal tools or just use-case specific SQL queries, the barrier to entry for writing code is dropping quickly. If you can write a prompt, you can write code. This is great for democratizing access to building software, but it also means that we will soon have people that aren't only non-database-experts, but actually do not have the technical knowledge to understand the potential implications of their changes.
-
Agents are creating new pressure - As organizations start to provide users with means to create agentic workflows that interact with the company's data assets and APIs, we can expect to see quickly changing access patterns to databases. Agents are not bound by the same constraints as humans, and they can generate and execute queries at a much faster rate. This can lead to increased contention, performance degradation, and even outages if the database is not able to keep up with the increased load.
An interesting implication of this is that even legacy applications that have not been touched in years can suddently find themselves on the "hot path" of someone's agentic workflow.
Contrary to "normal" application development, where queries are planned and discussed by architects and developers, agentic workflows by definition exhibit emergent behavior. Organizations will need to be able to quickly identify and adapt to these changes in order to maintain performance and reliability.
Case study: Unico's Database CI/CD Pipelineβ
As we consider if schema management is a platform concern, let's look at a real-world example. Unico, a leading digital identity provider in Brazil was dealing with similar challenges. Having suffered multiple outages that stemmed from database schema changes, they decided to take action.
Unico's Platform Engineering team, led by Luiz Casali, recognized that while they had streamlined CI/CD workflows for application code and infrastructure, database schema changes were still a major gap. These changes were often performed manually, leading to increased risk, unreliable deployments, and compliance headaches. Some teams were using tools that provided some automation, but they were complex and required specialized knowledge.
To address these challenges, the team sought a solution that would automate database schema management while integrating seamlessly into their existing workflows. After evaluating multiple tools, they chose Atlas, a schema-as-code solution that brought the declarative mindset of Terraform to databases.
Atlas provided Unico with multiple benefits:
- Automatic migration planning - Atlas automatically plans the migration for developers, calculating the diff between their code and the database schema.
- Automated verification - Atlas comes with a migration analysis tool that verifies the migration plan before applying it to the database, catching issues before they reach production. This step was integrated with Unico's standard CI/CD pipeline, ensuring that every migration was verified before deployment.
- Automated deployment - Atlas integrates with Unico's deployment tooling tooling including complex cases like automated rollbacks, completely removing the need for manual intervention.
- Compliance and governance - Atlas automatically generates documentation for every migration, providing a clear audit trail for compliance and governance purposes. Additionally, Atlas provides schema monitoring capabilities, allowing Unico to track changes to their database schema over time and make sure that no unexpected changes are introduced.
Recognizing the importance of schema management, Unico's Platform Engineering team made it a core part of their platform strategy. As a result of this change, Unico saw a significant reduction in outages, improved developer productivity, and a more reliable platform:
This success reinforced the decision to adopt Atlas:
"Month over month, we see smaller and smaller incidents.
β Luiz Casali, Senior Engineering Manager
Unico's experience highlights the clear benefits of treating schema management as a platform concern:
β Automatic Migration Planning β No DB expertise needed β Automated Code Review β No surprises in production β Automated Deployment β No more human errors β Everything is Audited β Easy compliance
Why Schema Management is a Platform Concernβ
Schema management should be a core platform responsibility, yet it is often overlooked. Let's break it down:
- Reducing Cognitive Load β Developers shouldn't have to master database internals. Platforms should provide declarative workflows and automation, letting engineers focus on features, not migrations.
- Increasing Reliability β Schema changes cause outages when done manually. Automated validation in CI/CD catches issues early, ensuring safe, predictable deployments.
- Boosting Velocity β When schema changes are easier and safer, teams ship more of them. Instead of fearing database updates, engineers iterate confidently, reducing bottlenecks.
Let's stop treating schema management as if it's optional and treat it as a first-class citizen in our platform strategies.
Conclusionβ
Database schema management can no longer be just an "application concern" β it's becoming a platform responsibility. As engineering teams scale, schema changes must be automated, validated, and integrated into CI/CD workflows to reduce risk and improve developer experience. With GenAI accelerating change and reshaping who writes code and how, the need for structured, platform-driven schema management has never been greater.