Skip to main content

2 posts tagged with "platform-engineering"

View All Tags

The Missing Chapter in the Platform Engineering Playbook

Β· 11 min read
Rotem Tamir
Building Atlas

Prepared for SREDay London 2025

Introduction​

Platform engineering is rapidly emerging as a discipline aimed at reducing cognitive load for developers, enabling self-service infrastructure, and establishing best practices for building and operating software at scale. While much of the conversation focuses on CI/CD, Kubernetes, and internal developer platforms, one crucial aspect often remains overlooked: database schema management.

Despite being at the heart of nearly every application, schema changes are still a major source of friction, outages, and bottlenecks. In this post, we'll explore why database schema management deserves a dedicated chapter in the platform engineering playbook and how organizations can integrate it into their platform strategies.

The prompt that nuked the database​

Let me tell you a not-so-fictional story about a developer named Alice. Alice is a backend engineer at a fast-growing startup. One day, her manager asked her to make a small change to the database. The data engineering team was complaining that they were seeing duplicate emails in the user table, and they suspected that the email column did not have a unique constraint.

Busy as she was working on the next release, Alice asked her trusty GitHub Co-pilot to assist her:

Alice: Hey Co-pilot, can you add a unique constraint to the email column in the users table?

Alice's team is using Sequelize, a popular ORM for Node.js, to manage their database schema. Co-pilot zipped through and added the constraint to the model:

email: {
allowNull: false,
type: DataTypes.STRING,
validate: {
isEmail: true
},
+ unique: true,
},

After adding the constraint, Alice knew she had to keep her database schema in sync with the application code. To do this, she needed to add a schema migration that would add the constraint to the database. Confident in her co-pilot's coding skills, she asked it to generate the migration for her:

Alice: Hey Co-pilot, can you please add the migration for this change?

Co-pilot generated:

'use strict';

module.exports = {
async up(queryInterface, Sequelize) {
await queryInterface.changeColumn('users', 'email', {
type: Sequelize.STRING,
allowNull: false,
unique: true // πŸ‘‰ This adds the UNIQUE constraint
});
},

async down(queryInterface, Sequelize) {
// redacted
}
};

Alice ran the migration on her local machine, and everything seemed fine. She tested the code on the staging environment, ran the migration, and everything seemed fine there as well. Alice's team reviewed the code, which seemed simple and straightforward, and approved the PR.

Confidently, she merges the code to the main branch and deploys it to production.

And then... the API starts timing out. User signups and logins grind to a halt. CPU usage on the database spikes. Even unrelated queries on the users table start hanging.

The Problem: Unique Constraint Validation & Table Locking​

Why did this happen?

Before applying the migration, PostgreSQL needs to ensure there are no duplicate emails before enforcing the constraint. If the table is large, this can take minutes or hours. During this time, PostgreSQL cannot allow queries can read or write anything from the table.

To do this, it acquires an ACCESS EXCLUSIVE lock on users while validating. This means no reads or writes can happen on users until validation is complete. To make things worse, If existing transactions are holding locks, your migration waits indefinitely, blocking everything else.

In other words, to guarantee the integrity of the unique constraint, PostgreSQL must "stop the world" for our users table and perform a full table scan, which can take quite a while. If this is a busy table (like the users table usually is), this means that requests will start piling up, draining resources from your database and servers and making things even worse.

Post-mortem: Why did this really happen?​

When something goes wrong, it's easy to blame the developer. But the truth is, Alice did everything right. She followed the best practices, tested her code, and reviewed it with her team. If the problem wasn't with Alice, where was it?

One technique for getting to the root cause of such a severe incident is the "5 Whys" technique. Let's apply it here:

  1. Why did the API start timing out? Because the migration locked the users table.
  2. Why did the migration lock the users table? Because our developers shipped broken code.
  3. Why did our developers ship broken code? Because it was approved by our quality assurance process.
  4. Why was it approved by our quality assurance process? Because we rely on manual code reviews and the reviewer didn't know about the issue.
  5. Why do we rely on manual code reviews? ... ?

Why do Platforms ignore Schema Management?​

The incident with Alice is not unique. Schema changes are a common source of outages, and yet, they are often overlooked in platform engineering discussions. Why?

I've interviewed dozens of platform engineers over the past few years, and the answer is oftentimes: "Schema migrations are an application concern, not a platform concern."

But is this really true?

Migrations are risky business​

Even with the best intentions, errors can still arise during the development and review of migrations, leading to downtime and data loss. Some of the failure modes that should be considered include:

  • Destructive changes - The SQL standard includes DROP commands that can remove tables, columns, and other database objects. While powerful, these commands can lead to data loss if not used carefully. Since most databases do not feature a built-in "undo" button, it's crucial to ensure that these commands are executed with caution.
  • Breaking changes - Schema changes can introduce breaking changes to the application code. For example, if a column is renamed or its type is changed, the application code may need to be updated to reflect these changes. This can lead to runtime errors if not handled properly.
  • Constraint violations - Changing constraints on a table can lead to constraint violations if the existing data does not meet the new constraints. For example, adding a NOT NULL constraint to a column that already contains null values will cause the migration to fail. This is especially problematic since these errors are typically not detected in development and staging environments (where datasets are often limited and less diverse), but only in production.
  • Table locks - Some changes, as we saw in Alice's case, require the database to acquire exclusive locks on the table, preventing any reads or writes during the migration. This can lead to downtime and degraded performance, especially for large tables.

For these reasons, schema changes should be treated with the same level of care and attention as application code changes. They should be tested, reviewed, and validated automatically before being deployed to production.

It's about to get worse: What GenAI means for databases​

The rise of Generative AI (GenAI) is changing the way we write code and interact with databases. With tools like GitHub Copilot and ChatGPT, developers are increasingly relying on AI to assist them in writing code, including database queries and schema changes. Let's reflect on some of the implications of this shift:

  • A lot more code, with less attention - As developers increasingly rely on AI to generate code, the amount of code being written is skyrocketing. LLMs are very adept at generating text that looks sensible, but not always is.

    On one extreme, this leads to hallucinations - where the AI generates code that is actually nonsensical. On the other extreme, it leads to unintended consequences - where the AI generates code that is syntactically correct, but semantically incorrect or has unintended side effects, as we saw with Alice's migration post-mortem.

    Since the code that LLMs produce will often look right, it can be particularly tricky to catch issues with normal, manual code reviews. This is especially true for database migrations, where the LLM may generate correct code that does not consider the safety implications of the change.

  • Skill-level needed to build apps is going down - As GenAI tools become more prevalent, we can expect to see a wider range of people writing code, including non-engineers. With the advent of "vibe coding" - the practice of using natural language prompts to iteratively build applications, the bar for writing code is dropping.

    Whether it's internal tools or just use-case specific SQL queries, the barrier to entry for writing code is dropping quickly. If you can write a prompt, you can write code. This is great for democratizing access to building software, but it also means that we will soon have people that aren't only non-database-experts, but actually do not have the technical knowledge to understand the potential implications of their changes.

  • Agents are creating new pressure - As organizations start to provide users with means to create agentic workflows that interact with the company's data assets and APIs, we can expect to see quickly changing access patterns to databases. Agents are not bound by the same constraints as humans, and they can generate and execute queries at a much faster rate. This can lead to increased contention, performance degradation, and even outages if the database is not able to keep up with the increased load.

    An interesting implication of this is that even legacy applications that have not been touched in years can suddently find themselves on the "hot path" of someone's agentic workflow.

    Contrary to "normal" application development, where queries are planned and discussed by architects and developers, agentic workflows by definition exhibit emergent behavior. Organizations will need to be able to quickly identify and adapt to these changes in order to maintain performance and reliability.

Case study: Unico's Database CI/CD Pipeline​

As we consider if schema management is a platform concern, let's look at a real-world example. Unico, a leading digital identity provider in Brazil was dealing with similar challenges. Having suffered multiple outages that stemmed from database schema changes, they decided to take action.

Unico's Platform Engineering team, led by Luiz Casali, recognized that while they had streamlined CI/CD workflows for application code and infrastructure, database schema changes were still a major gap. These changes were often performed manually, leading to increased risk, unreliable deployments, and compliance headaches. Some teams were using tools that provided some automation, but they were complex and required specialized knowledge.

To address these challenges, the team sought a solution that would automate database schema management while integrating seamlessly into their existing workflows. After evaluating multiple tools, they chose Atlas, a schema-as-code solution that brought the declarative mindset of Terraform to databases.

Atlas provided Unico with multiple benefits:

  • Automatic migration planning - Atlas automatically plans the migration for developers, calculating the diff between their code and the database schema.
  • Automated verification - Atlas comes with a migration analysis tool that verifies the migration plan before applying it to the database, catching issues before they reach production. This step was integrated with Unico's standard CI/CD pipeline, ensuring that every migration was verified before deployment.
  • Automated deployment - Atlas integrates with Unico's deployment tooling tooling including complex cases like automated rollbacks, completely removing the need for manual intervention.
  • Compliance and governance - Atlas automatically generates documentation for every migration, providing a clear audit trail for compliance and governance purposes. Additionally, Atlas provides schema monitoring capabilities, allowing Unico to track changes to their database schema over time and make sure that no unexpected changes are introduced.

Recognizing the importance of schema management, Unico's Platform Engineering team made it a core part of their platform strategy. As a result of this change, Unico saw a significant reduction in outages, improved developer productivity, and a more reliable platform:

This success reinforced the decision to adopt Atlas:

"Month over month, we see smaller and smaller incidents.

β€” Luiz Casali, Senior Engineering Manager

Unico's experience highlights the clear benefits of treating schema management as a platform concern:

βœ… Automatic Migration Planning β€” No DB expertise needed βœ… Automated Code Review β€” No surprises in production βœ… Automated Deployment β€” No more human errors βœ… Everything is Audited β€” Easy compliance

Why Schema Management is a Platform Concern​

Schema management should be a core platform responsibility, yet it is often overlooked. Let's break it down:

  • Reducing Cognitive Load – Developers shouldn't have to master database internals. Platforms should provide declarative workflows and automation, letting engineers focus on features, not migrations.
  • Increasing Reliability – Schema changes cause outages when done manually. Automated validation in CI/CD catches issues early, ensuring safe, predictable deployments.
  • Boosting Velocity – When schema changes are easier and safer, teams ship more of them. Instead of fearing database updates, engineers iterate confidently, reducing bottlenecks.

Let's stop treating schema management as if it's optional and treat it as a first-class citizen in our platform strategies.

Conclusion​

Database schema management can no longer be just an "application concern" β€” it's becoming a platform responsibility. As engineering teams scale, schema changes must be automated, validated, and integrated into CI/CD workflows to reduce risk and improve developer experience. With GenAI accelerating change and reshaping who writes code and how, the need for structured, platform-driven schema management has never been greater.

Case Study: How Unico's Platform Engineering Team Closed the DevOps/Databases Gap Using Atlas

Β· 6 min read
Rotem Tamir
Building Atlas

"Month over month, we see smaller and smaller incidents.", Luiz Casali, Sr Engineering Manager

Company Background​

Unico is a leading digital identity technology provider in Brazil, developing secure and efficient digital identity solutions for businesses and individuals. Their platform helps organizations streamline identity verification processes, delivering a seamless user experience while enhancing security and reducing fraud.

The Missing Layer: Tackling Complexity, Outages, and Risks in Database Schema Management​

At Unico, the Platform Engineering team, led by Luiz Casali, is focused on improving developer productivity. "Reducing complexity for developers is one of our top priorities," Luiz explained.

Unico's Platform team had previously built solutions to automate CI/CD workflows for code using Bazel and GitHub Actions and for infrastructure using Terraform and Atlantis. The team was missing a standardized solution for managing database schema changes.

cicd-database-gap

This gap introduced several pressing issues:

  1. Risky Manual Processes: Database schema changes (migrations) were performed manually, increasing the chance of human error.
  2. Unreliable Deployments: Unplanned, database-related outages were common, emphasizing the need for a safer way to handle database changes.
  3. Compliance Requirements: The team needed to document and review schema changes to maintain governance standards, but the lack of automation made this challenging.

Determined to bridge this gap and establish a safer, more efficient solution for developers, Unico's Platform Engineering team began researching the best database migration tools. Thiago da Silva Conceição, a Site Reliability Engineer (SRE) in the team, took the lead on this technical evaluation.

The Challenge of Managing Database Schema Migrations​

Traditional schema migration tools posed significant challenges for Unico. As Thiago noted, "Automation with our previous tool was never easy to maintain. It followed a specific pattern, and only a few team members were familiar with it." The team faced limitations that affected usability, integration, and overall productivity:

  • Limited Usability and Adoption: The tool required extensive knowledge, and documentation was limited, making it difficult to adopt across the team.
  • Lack of Automation Support: Automated migrations and reliable error-handling were lacking, leading to inconsistent deployments and a need for additional manual oversight.
  • Compliance Difficulties: The absence of automated documentation and governance features made it challenging to maintain and provide records for audit and compliance requirements.

With these challenges in mind, Unico needed a solution that could offer usability, integration with existing tools, and comprehensive metrics to continuously monitor and improve database migrations.

Evaluating Alternatives and Choosing Atlas​

"In the end, choosing Atlas was easy. It is a simple, yet powerful tool, offering a significant impact with many ready-made features that would require customization in other tools."
Thiago Silva Conceição, SRE, Unico

During the search for a new solution, Unico's engineering team prioritized several criteria:

  1. Ease of Use: The tool needed to be straightforward and accessible for all developers, not just a few specialized team members.
  2. Integration and Compatibility: It had to fit naturally with Unico's technology stack, particularly with Terraform, which was already in heavy use.
  3. Metrics and Alerts: Real-time insights and alerts were essential to monitor migrations effectively.

Thiago compared a few traditional solutions before selecting Atlas. Atlas's declarative schema-as-code approach, along with its HCL compatibility and robust cloud management, aligned well with Unico's needs. It allowed the team to automate migrations, eliminate manual errors, and centralize schema management, creating a unified experience across their projects. "Atlas allowed us to keep the DDL in HCL while still supporting SQL scripts for specific use cases through its versioning model," Thiago shared.

One Migration Tool to Rule Them All​

schema-migration-tool

Another key priority for Unico's Platform Engineering team was standardization. With multiple teams working across diverse programming languages and databases, The Platform Engineering team needed a unified migration tool that would work for a wide array of use cases, without sacrificing ease of use or reliability. To simplify the developer experience and streamline internal operations, they aimed to find a single solution that could support all teams consistently and seamlessly.

Atlas emerged as the ideal fit by providing plugin support for various databases, ORMs and integrations, making it a flexible tool for Unico's entire tech stack. The ability to standardize migration management with Atlas allowed Unico's Platform Engineering team to enforce consistent practices across all projects. Atlas became the single source of truth for schema management, offering a centralized framework for building policies, integrating into CI/CD pipelines, and supporting developers.

By implementing Atlas as a standard, the Platform Engineering team eliminated the need to train on or maintain multiple tools, reducing complexity and operational overhead. Now, Unico's developers enjoy a unified experience, and the platform team has only one tool to integrate, support, and scale as the company grows.

Implementation: A Smooth Migration Process​

The migration to Atlas was seamless, with no need to recreate migration files or impose rigid formats. "We simply imported the schemas from the production database, keeping the process straightforward and efficient," Thiago said. The team was able to quickly onboard Atlas and start seeing results, with pre-built actions in Atlas Cloud providing essential metrics, notifications, and dashboards for tracking progress.

This success reinforced the decision to adopt Atlas:

"Month over month, we see smaller and smaller incidents.

β€” Luiz Casali, Senior Engineering Manager

Outcome: Faster Development Cycles, Increased Reliability, and Enhanced Compliance​

With Atlas in place, Unico's Platform Engineering team has achieved several key outcomes:

  • Accelerated Development Cycles: Automation of database migrations streamlined the development process, enabling faster iterations and more rapid deployments.
  • Increased Reliability: Atlas's linting and testing tools reduced migration errors and enhanced deployment stability, contributing to Unico's goal of reducing incidents.
  • Enhanced Compliance: Atlas's automated documentation ensures that each migration step is recorded, simplifying compliance by providing a clear, auditable record of all schema changes.

By automating these processes, the team has successfully reduced manual work and achieved a more predictable migration workflow. Now, as Unico grows, they are assured that their migration practices will scale smoothly, maintaining operational costs without sacrificing speed or reliability.

ci-cd-database-gap-close

Getting Started​

Atlas brings the declarative mindset of infrastructure-as-code to database schema management, similar to Terraform, but focused on databases. Using its unique schema-as-code approach, teams can quickly inspect existing databases and get started with minimal setup.

Like Unico, we recommend anyone looking for a schema migration solution to get started with Atlas by trying it out on one or two small projects. Dive into the documentation, join our Discord community for support, and start managing your schemas as code with ease.