Skip to main content

25 posts tagged with "atlas"

View All Tags

Atlas is now SOC2 Certified for 2024

· 3 min read
Rotem Tamir
Building Atlas

Today we are happy to announce that Atlas has achieved SOC2 compliance for the third year in a row. This is an important milestone for us, demonstrating our commitment to providing a solid infrastructure for our users and customers.

soc2-atlas-ariga-compliance

As a company that is trusted by its customers to handle mission-critical databases, we are committed to ensuring the highest standards of security, availability, and confidentiality. Achieving SOC 2 compliance demonstrates our dedication to safeguarding customer data, maintaining trust, and adhering to industry best practices.

Our commitment to process only metadata

commitment to not process records from your database

Control 70. Our commitment to not process records from your database. Screenshot from Ariga's full SOC2 audit report.

As anyone in the compliance domain knows, audits are about setting a high bar and then building controls to ensure they are met throughout our day-to-day operations. While these audits often address external requirements, such as regulatory mandates, they also serve as an opportunity to build trust with customers by addressing critical areas of concern within the company.

This year, we chose to use our audit process to address a common question from customers regarding how we handle their data. As a schema management tool, Atlas interacts with critical and sensitive data assets of our customers. This naturally raises concerns for compliance and security teams, as they are entrusted with protecting data on behalf of their own customers.

Atlas, and Atlas Cloud, our SaaS offering, are designed with a foundational principle: we do not store, send, or process user data—only metadata. We have consistently communicated this commitment to compliance teams, and after thorough discussions and reviews, they have been satisfied with this approach. However, this year we decided to formalize this commitment within our compliance framework.

As part of our SOC 2 audit, we introduced Control #70 which states: "The company does not process or store records from the customer's managed databases, but only handles information schema and metadata related to them."

By incorporating this control, we have established a clear, auditable process that reinforces our promise to our customers and ensures that this principle remains at the core of how we operate moving forward.

To summarize, achieving SOC 2 compliance for the third year reflects our core belief as engineers: security, privacy, and automation should drive auditable processes. SOC 2 provides the framework to solidify these principles into a trusted, transparent process for our customers.

If your compliance team, needs access to our report or other documents, drop us a line!

Case Study: How Unico's Platform Engineering Team Closed the DevOps/Databases Gap Using Atlas

· 6 min read
Rotem Tamir
Building Atlas

"Month over month, we see smaller and smaller incidents.", Luiz Casali, Sr Engineering Manager

Company Background

Unico is a leading digital identity technology provider in Brazil, developing secure and efficient digital identity solutions for businesses and individuals. Their platform helps organizations streamline identity verification processes, delivering a seamless user experience while enhancing security and reducing fraud.

The Missing Layer: Tackling Complexity, Outages, and Risks in Database Schema Management

At Unico, the Platform Engineering team, led by Luiz Casali, is focused on improving developer productivity. "Reducing complexity for developers is one of our top priorities," Luiz explained.

Unico's Platform team had previously built solutions to automate CI/CD workflows for code using Bazel and GitHub Actions and for infrastructure using Terraform and Atlantis. The team was missing a standardized solution for managing database schema changes.

cicd-database-gap

This gap introduced several pressing issues:

  1. Risky Manual Processes: Database schema changes (migrations) were performed manually, increasing the chance of human error.
  2. Unreliable Deployments: Unplanned, database-related outages were common, emphasizing the need for a safer way to handle database changes.
  3. Compliance Requirements: The team needed to document and review schema changes to maintain governance standards, but the lack of automation made this challenging.

Determined to bridge this gap and establish a safer, more efficient solution for developers, Unico's Platform Engineering team began researching the best database migration tools. Thiago da Silva Conceição, a Site Reliability Engineer (SRE) in the team, took the lead on this technical evaluation.

The Challenge of Managing Database Schema Migrations

Traditional schema migration tools posed significant challenges for Unico. As Thiago noted, "Automation with our previous tool was never easy to maintain. It followed a specific pattern, and only a few team members were familiar with it." The team faced limitations that affected usability, integration, and overall productivity:

  • Limited Usability and Adoption: The tool required extensive knowledge, and documentation was limited, making it difficult to adopt across the team.
  • Lack of Automation Support: Automated migrations and reliable error-handling were lacking, leading to inconsistent deployments and a need for additional manual oversight.
  • Compliance Difficulties: The absence of automated documentation and governance features made it challenging to maintain and provide records for audit and compliance requirements.

With these challenges in mind, Unico needed a solution that could offer usability, integration with existing tools, and comprehensive metrics to continuously monitor and improve database migrations.

Evaluating Alternatives and Choosing Atlas

"In the end, choosing Atlas was easy. It is a simple, yet powerful tool, offering a significant impact with many ready-made features that would require customization in other tools."
Thiago Silva Conceição, SRE, Unico

During the search for a new solution, Unico's engineering team prioritized several criteria:

  1. Ease of Use: The tool needed to be straightforward and accessible for all developers, not just a few specialized team members.
  2. Integration and Compatibility: It had to fit naturally with Unico's technology stack, particularly with Terraform, which was already in heavy use.
  3. Metrics and Alerts: Real-time insights and alerts were essential to monitor migrations effectively.

Thiago compared a few traditional solutions before selecting Atlas. Atlas's declarative schema-as-code approach, along with its HCL compatibility and robust cloud management, aligned well with Unico's needs. It allowed the team to automate migrations, eliminate manual errors, and centralize schema management, creating a unified experience across their projects. "Atlas allowed us to keep the DDL in HCL while still supporting SQL scripts for specific use cases through its versioning model," Thiago shared.

One Migration Tool to Rule Them All

schema-migration-tool

Another key priority for Unico's Platform Engineering team was standardization. With multiple teams working across diverse programming languages and databases, The Platform Engineering team needed a unified migration tool that would work for a wide array of use cases, without sacrificing ease of use or reliability. To simplify the developer experience and streamline internal operations, they aimed to find a single solution that could support all teams consistently and seamlessly.

Atlas emerged as the ideal fit by providing plugin support for various databases, ORMs and integrations, making it a flexible tool for Unico's entire tech stack. The ability to standardize migration management with Atlas allowed Unico's Platform Engineering team to enforce consistent practices across all projects. Atlas became the single source of truth for schema management, offering a centralized framework for building policies, integrating into CI/CD pipelines, and supporting developers.

By implementing Atlas as a standard, the Platform Engineering team eliminated the need to train on or maintain multiple tools, reducing complexity and operational overhead. Now, Unico's developers enjoy a unified experience, and the platform team has only one tool to integrate, support, and scale as the company grows.

Implementation: A Smooth Migration Process

The migration to Atlas was seamless, with no need to recreate migration files or impose rigid formats. "We simply imported the schemas from the production database, keeping the process straightforward and efficient," Thiago said. The team was able to quickly onboard Atlas and start seeing results, with pre-built actions in Atlas Cloud providing essential metrics, notifications, and dashboards for tracking progress.

This success reinforced the decision to adopt Atlas:

"Month over month, we see smaller and smaller incidents.

— Luiz Casali, Senior Engineering Manager

Outcome: Faster Development Cycles, Increased Reliability, and Enhanced Compliance

With Atlas in place, Unico's Platform Engineering team has achieved several key outcomes:

  • Accelerated Development Cycles: Automation of database migrations streamlined the development process, enabling faster iterations and more rapid deployments.
  • Increased Reliability: Atlas's linting and testing tools reduced migration errors and enhanced deployment stability, contributing to Unico's goal of reducing incidents.
  • Enhanced Compliance: Atlas's automated documentation ensures that each migration step is recorded, simplifying compliance by providing a clear, auditable record of all schema changes.

By automating these processes, the team has successfully reduced manual work and achieved a more predictable migration workflow. Now, as Unico grows, they are assured that their migration practices will scale smoothly, maintaining operational costs without sacrificing speed or reliability.

ci-cd-database-gap-close

Getting Started

Atlas brings the declarative mindset of infrastructure-as-code to database schema management, similar to Terraform, but focused on databases. Using its unique schema-as-code approach, teams can quickly inspect existing databases and get started with minimal setup.

Like Unico, we recommend anyone looking for a schema migration solution to get started with Atlas by trying it out on one or two small projects. Dive into the documentation, join our Discord community for support, and start managing your schemas as code with ease.

New Release: Approval flows for Kubernetes, Prisma support, and more!

· 5 min read
Rotem Tamir
Building Atlas

Hey everyone!

We are excited to announce the release of Atlas v0.29, which continues our journey to make working with database easier, safer and more reliable. This release includes several significant updates that we are happy to share with you:

  • Approval flows for the Kubernetes Operator: Moving to a declarative way of managing database schemas has plenty of advantages, but many teams want to see and approve changes before they are applied. Doing this from the CLI is straightforward, but until recently it was not easy to provide this experience in Kubernetes-based workflows.

    With the new approval flows, you can now review and approve schema migrations seamlessly, ensuring database changes are well-governed while maintaining deployment velocity.

  • Prisma support: Following our integrations with some of the most popular ORMs in our industry, we are happy to announce our official guide on using Atlas to manage database schemas for Prisma projects.

  • GitLab CI/CD Components: Integrating GitLab CI with Atlas just got much easier with the new GitLab CI/CD components.

  • IntelliJ Plugin: Our IntelliJ plugin has been upgraded with code folding, inline SQL syntax highlighting and suggestions, and syntax highlighting within heredoc clauses.

  • Timeseries Engine support for ClickHouse: ClickHouse users can now explore beta support for timeseries data in Atlas.

  • Constraint Triggers support for PostgreSQL: PostgreSQL users can now manage constraint triggers with Atlas.

Approval Flows for the Kubernetes Operator

Moving to a declarative way of managing database schemas has plenty of advantages, but many teams want to see and approve changes before they are applied.

Providing flows for keeping a human-in-the-loop from the CLI is straightforward, but until recently it was not easy to provide this experience in Kubernetes-based workflows.

Following our recent KubeCon session, the Atlas Operator now includes approval flows for declarative schema migrations, making database changes in Kubernetes safer:

  1. Pre-approvals - with pre-approvals, teams enhance their CI pipelines to detect schema changes and integrate their planning and approval in the code review process. The approved policies are then applied to the database by the operator.
  2. Ad-hoc Approvals - with ad-hoc approvals, the operator pauses the migration process and waits for human approval before applying the schema change. This is useful for schema changes that were not approved in advance or for projects that do not have a strict pre-approval policy.

Prisma Support

Following popular demand from the Atlas community, we are excited to announce our official guide on using Atlas to manage database schemas for Prisma projects.

Prisma already has an excellent migration tool called prisma migrate, so why would you want to use Atlas with Prisma? In many cases, Prisma's migrate indeed checks all the boxes for managing your database schema. However, being tightly coupled with the Prisma ORM, some use cases might require a dedicated schema management tool that can be used across different ORMs and frameworks.

This guide shows how to load your existing prisma.schema file into Atlas, manage your schema changes, and apply them to your database using the Atlas CLI.

Interested in learning more? Read the guide!

Gitlab CI/CD Components

Integrating GitLab CI with Atlas just got much easier with our new GitLab CI/CD components.

GitLab CI/CD components are reusable YAML templates that you can use in your GitLab CI/CD pipelines to automate workflows within your GitLab project. Our newly published components are designed to simplify the process of integrating Atlas with GitLab CI/CD pipelines, enabling you to automate database schema management tasks with ease.

Want to learn more? Read the tutorial.

IntelliJ Plugin

Our IntelliJ plugin has been upgraded with code folding, inline SQL syntax highlighting and suggestions, and syntax highlighting within heredoc clauses. Our goal with these efforts is to make writing real world database applications with Atlas easier and more enjoyable.

If you use JetBrains editors, be sure to download the most recent version.

Timeseries Data Support for ClickHouse

ClickHouse recently added support for an experimental TimeSeries engine, which is designed to optimize storage and query performance for time-series data.

Atlas now supports this experimental feature, enabling ClickHouse users to manage schemas for their time-series data tables with ease:

You can simply define a TimeSeries table in your Atlas schema.

table "example" {
schema = schema.public
engine = TimeSeries
}

PostgreSQL Constraint Triggers

The CONSTRAINT TRIGGER is a PostgreSQL extension of the SQL standard, which works like a regular trigger but allows its execution time to be dynamically controlled using the SET CONSTRAINTS command.

Starting with this version, users can define constraint triggers, and Atlas will manage their lifecycles. Their definitions are also supported in the Atlas HCL syntax:

trigger "users_insert" {
on = table.users
constraint = true
before {
insert = true
}
// ...
}

trigger "groups_insert" {
on = table.users
constraint = true
deferrable = INITIALLY_DEFERRED
before {
insert = true
}
// ...
}

Read more about the Storage Engine

Wrapping Up

We hope you enjoy the new features and improvements. As always, we would love to hear your feedback and suggestions on our Discord server.

The Hard Truth about GitOps and Database Rollbacks

· 16 min read
Rotem Tamir
Building Atlas

Prepared for KubeCon North America 2024

Introduction

For two decades now, the common practice for handling rollbacks of database schema migrations has been pre-planned "down migration scripts". A closer examination of this widely accepted truth reveals critical gaps that result in teams relying on risky, manual operations to roll back schema migrations in times of crisis.

In this post, we show why our existing tools and practices cannot deliver on the GitOps promise of "declarative" and "continuously reconciled" workflows and how we can use the Operator Pattern to build a new solution for robust and safe schema rollbacks.

The Undo Button

One of the most liberating aspects of working on digital products is the ability to roll back changes. The Undo Button, I would argue, is one of the most transformative features of modern computing.

Correcting mistakes on typewriters was arduous. You would roll the carriage back and type over any errors, leaving messy, visible corrections. For bigger changes, entire pages had to be retyped. Correction fluid like whiteout offered a temporary fix, but it was slow and imprecise, requiring careful application and drying time.

Digital tools changed everything. The Undo Button turned corrections into a simple keystroke, freeing creators to experiment without fear of permanent mistakes. This shift replaced the stress of perfection with the freedom to try, fail, and refine ideas.

Rollbacks and Software Delivery

When it comes to software delivery, having an Undo Button is essential as well. The ability to roll back changes to a previous state is a critical safety net for teams deploying new features, updates, or bug fixes. Specifically, rollbacks impact one of the key metrics of software delivery: Mean Time to Recovery (MTTR).

MTTR is a measure of how quickly a system can recover from failures. When a deployment fails, or a bug is discovered in production, teams generally have two options: triage and fix the issue (roll forward), or roll back to a previous known stable state.

When the fix to an issue is not immediately clear, or the issue is severe, rolling back is often the fastest way to restore service. This is why having a reliable rollback mechanism is crucial for reducing MTTR and ensuring high availability of services.

How are rollbacks even possible?

Undoing a change in a local environment like a word processor is straightforward. There are multiple ways to implement an Undo Button, but they all rely on the same basic principle: the system keeps track of every change made and can revert to a previous state.

In a distributed system like modern, cloud-native applications, things are not so simple. Changes are made across multiple components with complex dependencies and configurations.

The key capability that enables rolling back changes is described in the seminal book, "Accelerate: The Science of Lean Software and DevOps". The authors identify "Comprehensive Configuration Management" as one of the key technical practices that enables high performance in software delivery:

"It should be possible to provision our environments and build, test, and deploy our software in a fully automated fashion purely from information stored in version control.” 1

In theory, this means that if we can store everything there is to know about our system in version control, and have an automated way to apply these changes, we should be able to roll back to a previous state by simply applying a previous commit.

GitOps and Rollbacks

The principle of "Comprehensive Configuration Management" evolved over the years into ideas like "Infrastructure as Code" and "GitOps". These practices advocate for storing all configuration and infrastructure definitions in version control in a declarative format, and using automation to apply these changes to the system.

Projects like ArgoCD and Flux have popularized the GitOps approach to managing Kubernetes clusters. By providing a structured way to define the desired state of your system in Git (e.g., Kubernetes manifests), and automatically reconciling the actual state with it, GitOps tools provide a structured and standardized way to manage satisfy this principle.

On paper, GitOps has finally brought us a working solution for rollbacks. Revert the commit that introduced the issue, and all of your problems are gone!

Problem solved. End of talk. Right?

The Hard Truth

Teams that have tried to fully commit to the GitOps philosophy usually find that the promise of "declarative" and "continuously reconciled" workflows is not as straightforward as it seems. Let's consider why.

Declarative resource management works exceptionally well for stateless resources like containers. The way Kubernetes handles deployments, services, and other resources is a perfect fit for GitOps. Consider how a typical deployment rollout works:

  1. A new replica set is created with the new version of the application.
  2. Health checks are performed to ensure the new version is healthy.
  3. Traffic is gradually shifted to healthy instances of the new version.
  4. As the new version proves stable, the old version is scaled down and eventually removed.

But will this work for stateful resources like databases? Suppose we want to change the schema of a database. Could we apply the same principles to roll back a schema migration? Here's what it would look like:

  1. A new database is spun up with the new schema.
  2. Health checks are performed to ensure the new schema is healthy.
  3. Traffic is gradually shifted to the new database instance.
  4. The old database is removed.

This would get the job done... but you would probably find yourself out of a job soon after.

Stateless resources are really great to manage because we can always throw out whatever isn't working for us and start fresh. But databases are different. They are stateful, and they are comprised not only of a software component (the database engine), the configuration (server parameters and schema), but also the data itself. The data itself cannot, by definition, be provisioned from version control.

Stateful resources like databases require a different approach to manage changes.

Up and Down Migrations

The common practice for managing schema changes in databases is to use "up" and "down" migration scripts in tandem with a migration tool (like Flyway or Liquibase). The idea is simple: when you want to make a change to the schema, you write a script that describes how to apply the change (the "up" migration). Additionally, you write a script that describes how to undo the change (the "down" migration).

For example, suppose you wanted to add a column named "short_bio" to a table named "users". Your up migration script might look like this:

ALTER TABLE users
ADD COLUMN short_bio TEXT;

And your down migration script might look like this:

ALTER TABLE users DROP COLUMN short_bio;

In theory, this concept is sound and satisfies the requirements of "Comprehensive Configuration Management". All information needed to apply and roll back the change is stored in version control.

Theory, once again, is quite different from practice.

The Myth of Down Migrations

After interviewing hundreds of engineers on this topic, we found that despite being a widely accepted concept, down are rarely used in practice. Why?

Naive assumptions

When you write a down migration, you are essentially writing a script that will be executed in the future to revert the changes you are about to make. By definition, this script is written before the "up" changes are applied. This means that the down migration is based on the assumption that the changes will be applied correctly.

But what if they are not?

Suppose the "up" migration was supposed to add two columns, the down file would be written to remove these two columns. But what if the migration was partially applied and only one column was added? Running the down file would fail, and we would be stuck in an unknown state.

Yes, some databases like PostgreSQL support transactional DDL, which means that if the migration fails, the changes are rolled back, and you end up with a state this consistent with a specific revision. But even for PostgreSQL, some operations cannot be run in a transaction, and the database can end up in an inconsistent state.

For MySQL, which does not support transactional DDL, the situation is even worse. If a migration fails halfway through, you are left with only a partially applied migration and no way to roll back.

Data loss

When you are working on a local database, without real traffic, having the up/down mechanism for migrations might feel like hitting Undo and Redo in your favorite text editor. But in a real environment with real traffic, it is not the case.

If you successfully rolled out a migration that added a column to a table, and then decided to revert it, its inverse operation (DROP COLUMN) does not merely remove the column. It deletes all the data in that column. Re-applying the migration would not bring back the data, as it was lost when the column was dropped.

For this reason, teams that want to temporarily deploy a previous version of the application, usually do not revert the database changes, because doing so will result in data loss for their users. Instead, they need to assess the situation on the ground and figure out some other way to handle the situation.

Incompatibility with modern deployment practices

Many modern deployment practices like Continuous Delivery (CD) and GitOps advocate for the software delivery process to be automated and repeatable. This means that the deployment process should be deterministic and should not require manual intervention. A common way of doing this is to have a pipeline that receives a commit, and then automatically deploys the build artifacts from that commit to the target environment.

As it is very rare to encounter a project with a 0% change failure rate, rolling back a deployment is something everyone needs to be prepared for.

In theory, rolling back a deployment should be as simple as deploying the previous version of the application. When it comes to versions of our application code, this works perfectly. We pull and deploy the container image corresponding to the previous version.

This strategy does not work for the database, for two reasons:

  1. For most migration tools, down or rollback is a separate command that needs to be executed specifically. This means that the deployment machinery needs to know what the current version of the target database is in order to decide whether to migrate up or down.
  2. When we pull artifacts from a previous version, they do not contain the down files that are needed to revert the database changes back to the necessary schema - they were only created in a future commit!

These gaps mean that teams are left with two options: either they need to manually intervene to roll back the database changes, or they need to develop a custom solution that can handle the rollback in an automated way.

Down Migrations and GitOps

Going back to our main theme of exploring whether database rollbacks and GitOps can be compatible, let's expand on this last point.

The ArgoCD documentation suggests that the way to integrate schema migrations is to use a Kubernetes Job that executes your migration tool of choice, and to annotate the Job as a PreSync hook:

This image will typically be built as part of your CI/CD pipeline, and will contain the migration tool and the migration scripts for the relevant commit or release:

apiVersion: batch/v1
kind: Job
metadata:
name: db-migration
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: migrate
image: your-migration-image:{{ .Values.image.tag }} # Example using Helm values
restartPolicy: Never

When ArgoCD detects a new commit in the Git repository, it will create a new Job that runs the migration tool. If the migration is successful, the Job will complete successfully, and the new version of the application will be deployed.

This will work for the up migration. But what happens when you need to roll back?

Teams commonly hit the two issues we mentioned above:

  1. The deployment machinery does not know what the current version of the target database is, and therefore cannot decide whether to migrate up or down.

    Unless a team has carefully thought about this and implemented a mechanism inside the image to decide what to do, the deployment machinery will always migrate up.

  2. The image that is pulled for the rollback does not contain the down files that are needed to revert the database changes back to the necessary schema. Most migration tools will silently keep the database in the current state.

What are the implications?

  1. The database is no longer in sync with the current Git commit, violating all GitOps principles.

  2. Teams that do need to roll back the database changes are left with a manual process that requires intervention and coordination.

Operators: The GitOps Way

The Operator Pattern is a Kubernetes-native way to extend the Kubernetes API to manage additional resources. Operators typically ship two main components: a Custom Resource Definition (CRD) that defines the new resource type, and a controller that watches for changes to these resources and takes action accordingly.

The Operator Pattern is a perfect fit for managing stateful resources like databases. By extending the Kubernetes API with a new resource type that represents a database schema, we can manage schema changes in a GitOps-friendly way. A specialized controller can watch for changes to these resources and apply the necessary changes to the database in a way that a naive Job cannot.

The Atlas Operator

The Atlas Operator is a Kubernetes Operator that enables you to manage your database schemas natively from your Kubernetes cluster. Built on Atlas, a database schema-as-code tool (sometimes called "like Terraform for databases"), the Atlas Operator extends the Kubernetes API to support database schema management.

Atlas has two core capabilities that are helpful to building a GitOps-friendly schema management solution:

  1. A sophisticated migration planner that can generates migrations by diffing the desired state of the schema with the current state of the database.
  2. A migration analyzer that can analyze a migration and determine whether it is safe to apply and surface risks before the migration is applied.

Declarative vs. Versioned Flows

Atlas supports two kinds of flows for managing database schema changes: declarative and versioned. They are reflected in the two main resources that the Atlas Operator manages:

Declarative: AtlasSchema

The first resource type is AtlasSchema which is used to employ the declarative flow. With AtlasSchema, you define the desired state of the database schema in a declarative way, and the connection string to the target database.

The Operator is then responsible for generating the necessary migrations to bring the database schema to the desired state, and applying them to the database. Here is an example of an AtlasSchema resource:

apiVersion: db.atlasgo.io/v1alpha1
kind: AtlasSchema
metadata:
name: myapp
spec:
url: mysql://root:pass@mysql:3306/example
schema:
sql: |
create table users (
id int not null auto_increment,
name varchar(255) not null,
email varchar(255) unique not null,
short_bio varchar(255) not null,
primary key (id)
);

When the AtlasSchema resource is applied to the cluster, the Atlas Operator will calculate the diff between the database at url and the desired schema, and generate the necessary migrations to bring the database to the desired state.

Whenever the AtlasSchema resource is updated, the Operator will recalculate the diff and apply the necessary changes to the database.

Versioned: AtlasMigration

The second resource type is AtlasMigration which is used to employ the versioned flow. With AtlasMigration, you define the exact migration that you want to apply to the database. The Operator is then responsible for applying any necessary migrations to bring the database schema to the desired state.

Here is an example of an AtlasMigration resource:

apiVersion: db.atlasgo.io/v1alpha1
kind: AtlasMigration
metadata:
name: atlasmigration-sample
spec:
url: mysql://root:pass@mysql:3306/example
dir:
configMapRef:
name: "migration-dir" # Ref to a ConfigMap containing the migration files

When the AtlasMigration resource is applied to the cluster, the Atlas Operator will apply the migrations in the directory specified in the dir field to the database at url. Similarly to classic migration tools, Atlas uses a metadata table on the target database to track which migrations have been applied.

Rollbacks with the Atlas Operator

The Atlas Operator is designed to handle rollbacks in a GitOps-friendly way. This is where the power of the Operator Pattern really shines as it can make nuanced and intelligent decisions about how to handle changes to the managed resources.

To roll back a schema change in an ArgoCD-managed environment, you would simply revert the AtlasSchema or AtlasMigration resource to a previous version. The Atlas Operator would then analyze the changes and generate the necessary migrations to bring the database schema back to the desired state.

Advantages of the Operator Pattern

In the discussion above we kept talking about edge cases that arise when rolling back database schema changes, and concluded that they require manual consideration and intervention. What if we could automate this process?

The Operator Pattern is all about codifying operational knowledge into software. Let's consider how the Operator Pattern can be used to address the challenges we discussed:

  1. Understanding intent. The Operator can discern between up and down migrations. By comparing between the current state of the database and the desired version, the operator decides whether to go up or down.

  2. Having access to the necessary information. Contrary to a Job that only has access to the image it was built with, the Operator stores metadata about the last execution as a ConfigMap via the Kubernetes API. This metadata enables the operator to migrate down even though the current image does not information about the current state.

  3. Intelligent Diffing. Because the Operator is built on top of Atlas's Schema-as-Code engine, it can calculate correct migrations even if the database is in an inconsistent state.

  4. Safety checks. The Operator can analyze the migration and determine whether it is safe to apply. This is a critical feature that can prevent risky migrations from being applied. Depending on your policy, it can even require manual approval for specific types of changes!

Conclusion

In this talk, we explored the challenges of rolling back database schema changes in a GitOps environment. We discussed the limitations of the traditional up/down migration approach, and how the Operator Pattern can be used to build a more robust and automated solution.

If you have any questions or would like to learn more, please don't hesitate to reach out to us on our Discord server.


1: Forsgren, Nicole, Jez Humble, and Gene Kim. Accelerate: The Science of Lean Software and DevOps. IT Revolution Press, 2018.

Deep Dive into Declarative Migrations

· 15 min read
Rotem Tamir
Building Atlas

Prepared for an Atlas Community Webinar, October 2024

Introduction

In recent years, the shift to declarative resource management has transformed modern infrastructure practices. Groundbreaking projects like Terraform, for infrastructure as code, and Kubernetes, for container orchestration, have exemplified the power of this approach. By focusing on what the end state should be rather than how to achieve it, declarative methods make systems more scalable, predictable, and easier to maintain—essential qualities for handling today's complex environments.

However, when it comes to managing database schemas, the industry has been slow to adopt declarative workflows. Atlas was created almost four years ago to address this gap.

Atlas supports two kinds of database schema migration workflows:

  • Versioned Migrations - each change to the database is described as a migration script, essentially a SQL file containing the SQL commands to apply the change. Migrations are versioned and applied in order.

    Contrary to most existing migration tools, Atlas relies on users defining the desired state of the database schema in code Atlas generates the necessary migration scripts to bring the database to the desired state.

  • Declarative Migrations - the database schema is described in a declarative way, and changes are applied by comparing the desired schema with the current schema and generating the necessary SQL commands to bring the database to the desired state.

To date, most teams that used Atlas in production have used it's versioned migration workflow which synthesizes the simplicity and explicitness of classic migration tools with the benefit of automated migration generation.

Recent improvements to Atlas have addressed many of the challenges and concerns teams have expressed around using declarative migrations in production in the past. In this post, we'll take a deep dive into the declarative migration workflow

Declarative Migrations in Production?

Declarative migrations are a powerful concept, and it may surprise you, but, they are not new. In previous incarnations, they were often referred to as "state based" migrations, but they were never regarded as a production-grade solution. A quick look around the documentation of popular tools will reveal that they are often discouraged:

Hibernate ORM:

Although the automatic schema generation is very useful for testing and prototyping purposes, in a production environment, it’s much more flexible to manage the schema using incremental migration scripts.

EF Core Docs:

EnsureCreated will create the database if it doesn't exist and initialize the database schema. If any tables exist (including tables for another DbContext class), the schema won't be initialized.

Liquibase:

State-based deployments offer a quick, simplistic way to determine change scripts and are useful for analysis and drift detection. Due to its simplicity, though, it’s unreliable and risky to use as a primary change management approach.

Challenges of Declarative Migrations

Having interviewed many engineers and architects about their experience with schema management, we've identified several concerns that teams have with declarative migrations:

Approval Process

Versioned migrations serve a dual purpose, they are both an explicit set of instructions for the migration tool and a source-controlled artifact that can be examined and approved by human reviewers.

For many teams, the idea of only versioning the desired state of the database schema and not the actual migration scripts means forfeiting their main mechanism for review and approval of schema changes. This is a valid concern, and any production-grade declarative migration tool should provide a way to review and approve changes before they are applied.

Safety

Schema changes are a risky business, if you are not careful you can easily find yourself in a situation where you've accidentally deleted data, locked a table for writes for a long period of time or introduced a production outage via a breaking schema change.

By definition, migrations generated by declarative workflows are non-deterministic, as it depend as they depend on the current state of the database. This also one of their main advantages, as they can handle unexpected circumstances and reconcile drift automatically. However, many teams are uncomfortable with the idea of not knowing exactly what will be applied to their database.

As one user once wrote us on HackerNews:

I'll never be comfortable with any tool for that automatically generates schema changes, as I'm just never sure at what point it decides to delete parts of my prod db.

Customization

Another common concern is the lack of control over the generated migrations. Migration planning can be thought of as a navigation problem, how to get from point A to point B. Declarative migrations are like using a GPS, you tell it where you want to go, and it figures out the best route. But what if you want to travel through some specific coordinates?

When it comes to database schema management, there may be multiple ways to get from the current to the desired state, some of them completely undesirable. For instance, a simple column rename can also be achieved by dropping the column and creating a new one (thereby losing all the data in the column) or with PostgreSQL index creation may lock a table if not used with the CONCURRENTLY option.

Tools like Kubernetes and Terraform don't let the user specify the exact steps to get to the desired state, or require the development of custom plugins (such as Terraform Providers or Kubernetes Controllers) to achieve this. For database schema changes, this level of customization is often necessary.

Production-grade Declarative Migrations

Atlas was designed with these concerns in mind and in the next section, we'll explore how Atlas addresses these concerns to provide teams with a production-grade declarative migration workflow.

Pre-planning Migrations

To address users' need for knowing the exact migrations that will be applied to their database, Atlas provides the schema plan command. Users may run a command similar to:

atlas schema plan --env local --save

Atlas will calculate the plan, analyze it and present the user with a detailed report of the migrations that will be applied and their potential impact on the database:

Planning migration from local database to file://./schema.pg.hcl (1 statement in total):

-- create "t1" table:
-> CREATE TABLE "t1" (
"c1" text NOT NULL
);

-------------------------------------------

Analyzing planned statements (1 in total):

-- no diagnostics found

-------------------------
-- 19.389709ms
-- 1 schema change
? Approve or abort the plan:
▸ Approve and save
Abort

After saving, a "plan file" file is created:

plan "20241029120559" {
from = "PKrr2qiovsNurI70kgT+AW3rInDu74E1PSOSHgh8CrA="
to = "Szdpl/ADvG8kKrXBERSxjvU/I1nprpBejpPX7fLoWmc="
migration = <<-SQL
-- Create "t1" table
CREATE TABLE "t1" (
"c1" text NOT NULL
);
SQL
}

The plan block contains three relevant fields:

  • from - A "fingerprint" hash which encodes the "current" state of the database.
  • to - A similar hash for the "desired" state of the database.
  • migration - The SQL commands that will be applied to the database to bring it from the from state to the to state.

The from and to fields are significant because they allow Atlas to guarantee that the plan will only be executed if the database is in the known from state. This is a crucial safety feature that enables the deterministic execution of the plan.

The plan file can then be applied to the database using the schema apply command:

atlas schema apply --env local --plan file://20241029120559.plan.hcl

Customization

As we previously mentioned, there are often multiple ways to get from the current to the desired state of the database. To provide users with the ability to customize the generated migrations, Atlas provides two ways to control the process.

Diff Policies

Diff policies can be defined on the project level to control how Atlas generates migrations. For instance, a project may define a policy that indexes must be created or dropped concurrently:

env "local" {
diff {
// By default, indexes are not added or dropped concurrently.
concurrent_index {
add = true
drop = true
}
}
}

In some cases, users may want to tell Atlas to skip DROP operations completely:

env "local" {
diff {
skip {
drop_schema = true
drop_table = true
}
}
}
Editing Plans

In some cases, users may want to edit the generated plan before applying it. By using the --edit flag on the schema plan command, Atlas will open the plan in the user's default editor for review and modification:

When the user runs:

atlas schema plan --env local --edit --save

Atlas will calculate the plan, and open it in the user's default editor:

-- Create "t1" table
CREATE TABLE "t1" (
"c1" text NOT NULL
);

The user may edit the plan as needed, adding or removing statements, and then save the plan.

-- Create "t1" table
CREATE TABLE "t1" (
"c1" text NOT NULL
);
+ INSERT INTO "t1" ("c1") VALUES ('hello');

Atlas verifies that the edited plan is still valid and that the resulting schema is equivalent to the desired schema.

Suppose a user makes a change that is not in line with the desired schema, for instance, adding a column that is not in the desired schema:

-- Create "t1" table
CREATE TABLE "t1" (
"c1" text NOT NULL,
+ "c2" text NOT NULL
);

Atlas will detect the drift and present the user with an error message:

Abort: the planned state does not match the desired state after applying the file:

--- planned state
+++ desired state
@@ -1,3 +1,4 @@
CREATE TABLE "t1" (
- "c1" text NOT NULL
+ "c1" text NOT NULL,
+ "c2" text NOT NULL
);

Safety Checks

In addition to its advanced diffing capabilities, Atlas contains a migration analysis engine that can simulate changes and detect potential issues before they occur. Until recently, this engine was only available to users of the versioned migration workflow via the migrate lint command. However, in the latest release, we've included the analysis step in the schema plan and schema apply commands for declarative migrations.

When a user runs the schema plan command, Atlas will analyze the plan and present the user with a detailed report of the potential impact of the migrations.

For instance, if a user tries to drop a column or table, Atlas will present a warning:


Planning migration from local database to file://./schema.pg.hcl (1 statement in total):

-- drop "t1" table:
-> DROP TABLE "t1";

-------------------------------------------
Analyzing planned statements (1 in total):

-- destructive changes detected:
-- L2: Dropping table "t1"
https://atlasgo.io/lint/analyzers#DS102
-- suggested fix:
-> Add a pre-migration check to ensure table "t1" is empty before dropping it

-------------------------
-- 16.281417ms
-- 1 schema change
-- 1 diagnostic

? Approve or abort the plan:
▸ Approve and save
Abort

For a full list of diagnostics and suggested fixes, users can visit the Atlas Lint Analyzers page.

Modern CI/CD Workflows with Declartive Migrations

Atlas was designed to be used in modern CI/CD workflows, and the declarative migration workflow is no exception. In the final section of this post, we'll explore how teams can integrate Atlas into their CI/CD pipelines to ensure that:

  1. All changes to the database schema result from changes to the desired state.
  2. Changes are planned automatically during the CI process.
  3. Changes are analyzed automatically to detect potential issues.
  4. Changes are approved by the team before being applied to the database.
  5. Changes are applied to the database in a deterministic and safe manner.

Demo: CI/CD Workflow with Atlas

Here's a quick demo video of how to integrate Atlas into a CI/CD pipeline using GitHub Actions:

.

To summarize, our demo CI/CD workflow works as follows:

  1. The user changes the desired state of the database, in our case a file named schema.sql.
  2. The user commits the changes to the schema.sql file and creates a pull request.
  3. GitHub Actions runs the atlas schema plan command to calculate the migration plan.
  4. The Action creates a comment on the pull request with the following information:
    • The migration plan.
    • An analysis of the changes, including any potential issues.
    • Instructions for manually editing the plan if needed.
  5. The user reviews the plan and approves the PR.
  6. Once the PR is merged, GitHub Actions run another action that pushes the new state and plan to the Schema Registry.
  7. Once the desired state and plan are in the Schema Registry, the user can deploy the changes to the database using the atlas schema apply command.

A word about the Schema Registry

The Schema Registry is a fairly new concept introduced by Atlas. The registry is a central location where schemas and migration plans can be stored and shared among team members. You can think about it as a "DockerHub for Schema Management" or an "Artifact Repository for Database Schemas".

As we saw, it is possible to simply use files to store the desired state and migration plans, but the Schema Registry provides several advantages:

  • Source of Truth - the registry can be used as the source of truth for the database schema, ensuring that all team members are working with the same schema and that all changes are tracked and go through a structured review process. Similar to how teams use DockerHub to store and share container images, making sure that only code that went through a team's CI/CD pipeline makes it's way to production.
  • Streamlined Deployment - as we will see below, the registry isn't a static blob storage, but a dynamic service that can be queried and interacted with. This allows the Atlas CLI to only fetch relevant and approved migration plans, during deployment, enforcing the deterministic and safe execution of the migrations.
  • Visualizations and Insights - the registry can be used to generate visualizations of the schema, track changes over time, and provide insights into the state of the database. This can be useful for auditing, debugging, and planning future changes.

How everything fits together

As we saw in the demo, the Schema Registry is a central component of the CI/CD workflow. It acts as the source of truth for the database schema and migration plans, ensuring that all changes are tracked and reviewed before being applied to the database.

Here's what happens when a user runs the atlas schema apply command:

  1. Atlas inspects the target database and fetches the current state of the schema. Atlas assigns a "fingerprint hash".
  2. Atlas analyzes the desired state of the database and calculates the hash for that state.
  3. Atlas uses this tuple of (from, to) hashes to search the Schema Registry for a matching migration plan.
  4. If a plan is found, Atlas verifies that it is approved and safe to apply.
  5. If the plan is approved, Atlas applies the plan to the database.

What happens if an approved plan is not found?

By default, Atlas will not apply any changes to the database if an approved plan is not found. This is a safety feature that ensures that only changes that have been reviewed and approved by the team are applied to the database.

However, in some cases, users may want to apply changes to the database that have not been approved. After all, most schema changes are simple and additive and don't pose any risk to the database. Additionally, some environments are mostly used for testing and development and don't require the same level of scrutiny as production.

In cases where no approved plan is found, Atlas will calculate the plan and run it through analysis ("linting").

Atlas provides a mechanism called "Review Policies" to define what should happen based on the linting results.

Review policies are defined in the project configuration file using the lint.review attribute:

lint {
review = ERROR // ERROR | ALWAYS
}

The review policy can be set to one of the following values:

  1. ERROR - Atlas will require manual review and approval only if the linting report contains errors, i.e., one of the analyzers is configured to return an error. See the destructive-changes analyzer as an example.
  2. WARNING - Atlas will require manual review and approval if the linting report contains warnings (diagnostics) or errors. See the list of checks that can be detected by the analyzers.
  3. ALWAYS (default) - Atlas will always require manual review and approval, regardless of the linting report.

Wrapping Up

In this post, we've taken a deep dive into the declarative migration workflow in Atlas. Here's what we covered:

  1. Until recently, declarative migrations were considered a risky and unreliable way to manage database schemas.
  2. The top challenges teams have with declarative migrations are around approval, safety, and customization.
  3. Atlas addresses these concerns by providing a way to review and approve changes, analyze and simulate changes, and customize the generated migrations.
  4. Atlas can be integrated into modern CI/CD workflows to ensure that all changes to the database schema are tracked, reviewed, and applied in a deterministic and safe manner.
  5. The Schema Registry is a central component of the CI/CD workflow, acting as the source of truth for the database schema and migration plans.
  6. Atlas provides a mechanism called "Review Policies" to define what should happen based on the linting results.

We hope this post has given you a better understanding of how Atlas can be used to manage database schemas in a declarative way. If you have any questions or would like to learn more, please don't hesitate to reach out to us on our Discord server.

How Conceal.IO Manages 1,500+ Redshift Schemas Using Atlas

· 4 min read
Rotem Tamir
Building Atlas

"Everything on Atlas is just making too much sense for us."
— Kaushik Shanadi, Chief Architect

Conceal, a cybersecurity company, creates a secure browsing experience using a browser extension. With a lean engineering team, When Conceal shifted from serving individual consumers to working with managed service providers (MSPs), their clients' security requirements drove the need for a robust, multi-tenant architecture to ensure data isolation and scalability.

Kaushik Shanadi, VP and Chief Architect, led the charge in finding that solution.

Database per tenant testimonial redshift

Evaluating Alternative Solutions

To meet the growing business demands and provide sufficient isolation for each customer’s data in a scalable and secure manner, the team considered three alternatives:

  1. Keep a Single Database, Isolating on the Application Layer: This option was quickly dismissed because of the team's negative experience with this solution.

    Like many others, they found that adding a tenant_id column to every table and hoping that developers remember to filter by it on every query was a burden and risk they were not willing to take. Additionally, this was not acceptable to some of their customers, who required strict data isolation.

  2. Database-per-Tenant Approach: While this approach ensured both data isolation and scalability, the cost of maintaining a Redshift cluster for each customer made this alternative prohibitive.

  3. Hybrid Solution (Schema-per-Tenant): Ultimately, they chose a schema-per-tenant model, which kept the data isolated and secure without the high cloud costs. This approach also offered the flexibility to switch specific customers to their own isolated database if needed.

"It was way too easy to do that with Atlas compared to any other method," Kaushik remarked.

info

For a deeper analysis of the different multi-tenant architectures, read our blogpost and guide on the topic.

The Challenge of Schema-per-Tenant Architecture

Schema-per-tenant architectures present unique challenges, primarily around managing database schema migrations:

  • Migration duration scales linearly with tenant count.
  • Detecting inconsistencies becomes a needle in a haystack problem.
  • Rollbacks are difficult to orchestrate.

Atlas overcomes these challenges with its declarative schema-as-code approach. By automating migration planning, Atlas ensures that every schema remains consistent, aligned, and easy to manage.

Read more about the challenges and how Atlas solves them here

Safety First: Managing 1,500+ Schemas with 7 Engineers Using One Tool

Implementation

According to Kaushik, the implementation process was easy and smooth. Amazon Redshift was a requirement for both long-term storage and machine learning (ML) training data. "Migrating with old migration tools is a nightmare," said Kaushik. After discovering that Atlas supports Redshift, he ran a few POCs locally to test Atlas.

"I was able to get everything working and saw how fast the migration process was, so we pushed it to development," he explained.

Outcome

By leveraging Atlas' declarative Schema-as-Code capabilities and its ability to manage schema per tenant architectures, Conceal.IO successfully manages a fleet of over 1,500 identical Redshift schemas, isolating customers' data from one another as required. This unified approach ensures that all schemas are designed consistently, aligned, and kept in the same state. This alignment has resulted in:

  • Faster feature deployment across all customers
  • Improved analytics and ML preparation
  • Faster onboarding for new engineers
  • The ability to add more customers as needed without worrying about deployments becoming slower as they grow
  • Fixed operational costs, ensuring scalability without an exponential rise in expenses

All of this was achieved in a safe and secure environment.

"Having a lot of database schemas but only one tool to manage them all makes our lives so much easier", Kaushik added.

Next Steps

The Conceal.IO team plans to use the upcoming holiday season to migrate their production workloads running on PostgreSQL into Atlas as well. Tired of dealing with constant migration issues, Kaushik is confident that “using Atlas will make it so much easier.”

"Support has been awesome, and the speed has been incredible too," summarized Kaushik. "Everything I need from an enterprise solution is available out of the box. It was almost too easy to say — let’s just use Atlas."

Getting Started

Atlas applies the declarative mindset to database schema management, like Terraform, but for databases. Using its unique, schema-as-code approach, teams can automatically inspect existing databases and get started in no time.

Like Conceal.IO, we recommend anyone looking for a schema migration solution to get started with Atlas by trying it out on one or two small projects. Read the documentation, join our Discord community for support, and start managing your schemas as code.

Strategies for Reliable Schema Migrations

· 12 min read
Rotem Tamir
Building Atlas

Adapted from a talk given at Kube Native 2024:

Introduction

Database schema migrations are a critical part of the software development lifecycle. They allow us to evolve our data model as our application grows and changes. However, migrations can also be a significant source of risk and downtime if not handled carefully.

Despite teams applying all the common best practices, such as using a migration tool to automate changes and carefully examining each migration before approving it during code review, issues still slip through the cracks, making migrations a source of significant headaches for many engineering teams.

This post explores five strategies for making database schema migrations more reliable by introducing Atlas, a database schema-as-code tool that helps teams make their database schema changes both safer and easier to manage.

Strategies for Reliable Schema Migrations with atlas

Prerequisite: Automate

Having interviewed over a hundred engineering teams about their database migration practices, we found that a surprisingly large number of teams perform database migrations manually. This involves running SQL statements via scripts or a point-and-click interface directly against the database, often without proper testing or version control.

Manual migrations are error-prone and difficult to reproduce, leading to inconsistencies between environments and increased risk of downtime. Additionally, manual changes to the database are amongst the most stressful and dreaded tasks for developers, as they can lead to nasty outages if not done correctly.

Much has been written about the importance of automating database migrations, but it's worth reiterating here. Even if you do nothing else, please automate your schema changes!

Strategy 1: Schema-as-Code

Classic database migration tools like Flyway or Liquibase are great for automating the process of executing SQL scripts against a database. However, despite being categorized as "automation tools," they still require plenty of manual work to write, test and review the SQL scripts that define the schema changes.

Atlas takes a different approach by treating the database schema as code. Instead of manually defining the schema changes in SQL scripts, developers define the desired state of the database schema in a declarative format using code and let the tool handle the rest.

For instance, a developer may define the following schema in Atlas:

schema.sql
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(255) NOT NULL
);

Next, to apply this schema to a database, the developer runs the following command:

atlas schema apply --env local --to file://schema.sql

Atlas will connect to the database, inspect its current schema, calculate the diff between the current and desired state, and propose a migration plan to bring the database to the desired state. The developer can then review and approve the migration before applying it.

By automating the process of defining and applying schema changes, Atlas makes it easier to manage database migrations and reduces the risk of human error.

Strategy 2: Test DB logic like any other code

Modern databases are a lot more than just containers for data. They also embody complex business logic in the form of constraints, triggers, stored procedures, and functions.

To make sure that these database objects work as expected, and keep doing so after changes are made, it's important to test them like any other code.

Atlas provides a testing framework that allows developers to write tests with a simple syntax. For example, to test a function that returns true if the input is positive and false otherwise, a developer might write the following test:

schema.test.hcl
test "schema" "positive_func" {
parallel = true
assert {
sql = "SELECT positive(1)"
}
log {
message = "First assertion passed"
}
assert {
sql = <<SQL
SELECT NOT positive(0);
SELECT NOT positive(-1);
SQL
}
log {
message = "Second assertion passed"
}
}

By treating database objects as code and writing tests for them, developers can catch make sure that their database code works reliably and consistently and prevent regressions when making changes.

Strategy 3: Test data migrations

Most commonly, migrations deal with schema changes, such as adding or removing columns, creating tables, or altering constraints. However, as your application evolves, you may need to add or refactor data within the database, which is where data migrations come in. For instance, you may need to seed data in a table, backfill data for existing records in new columns, or somehow transform existing data to accommodate changes in your application.

Data migrations can be especially tricky to get right, and mistakes can be problematic and irreversible. For this reason testing data migrations is crucial. Testing data migrations typically involves the following steps:

  1. Setting up an empty database.
  2. Applying migrations up to the one before the test.
  3. Seeding test data.
  4. Running the migration under test.
  5. Making assertions to verify the results.

This process can be cumbersome to set up and buggy as it often involves writing an ad-hoc program to automate the steps mentioned above or manually testing the migration.

Atlas's migrate test command simplifies this by allowing you to define test cases in a concise syntax and acts as a harness to run these tests during local development and in CI.

Using Atlas, developers can write tests for data migrations in a simple format, making it easier to catch issues early and ensure that data migrations work as expected. For example, to test a data migration that backfills a new column with values from an existing column, a developer might write the following test:

migrate.test.hcl
test "migrate" "check_latest_post" {
migrate {
to = "20240807192632"
}
exec {
sql = <<-SQL
INSERT INTO users (id, email) VALUES (1, 'user1@example.com'), (2, 'user2@example.com');
INSERT INTO posts (id, title, created_at, user_id) VALUES (1, 'My First Post', '2024-01-23 00:51:54', 1), (2, 'Another Interesting Post', '2024-02-24 02:14:09', 2);
SQL
}
migrate {
to = "20240807192934"
}
exec {
sql = "select * from users"
format = table
output = <<TAB
id | email | latest_post_ts
----+-------------------+---------------------
1 | user1@example.com | 2024-01-23 00:51:54
2 | user2@example.com | 2024-02-24 02:14:09
TAB
}
log {
message = "Data migrated successfully"
}
}

In this test, the developer migrates the database to a specific version, seeds test data, runs the migration under test, and verifies the results. By automating this process, Atlas makes it easier to test data migrations and catch issues early.

Strategy 4: Automate Safety Checks

Even with the best intentions, mistakes can still happen during the development and review of migrations leading to downtime and data loss. To mitigate this risk, it's important to automate safety checks that catch common mistakes before applying the migration to the database.

Before we dive into strategies for reliable migrations, let's take a look at some common ways migrations can go wrong.

Destructive Changes

mysql> select * from dropped;

ERROR 1146 (42S02): Table 'default.dropped' doesn't exist

Migrations often involve DROP DDL statements that can lead to data loss if executed against a table or column that still contains valuable data. Unfortunately, modern databases do not have a built-in undo button, so once a destructive change is applied, it can be challenging (if not impossible) to recover the lost data.

This might sound like an obvious mistake to avoid, but it's surprisingly common in practice. For example, consider this published incident report from Resend, which states:

On February 21st, 2024, Resend experienced an outage that affected all users due to a database migration that went wrong. This prevented users from using the API (including sending emails) and accessing the dashboard from 05:01 to 17:05 UTC (about 12 hours).

The database migration accidentally deleted data from production servers. We immediately began the restoration process from a backup, which completed 6 hours later.

Constraint Violations

mysql> alter table candy ADD UNIQUE (name);

ERROR 1062 (23000): Duplicate entry 'm&m' for key 'candy.name'

Migrations that involve adding or modifying constraints can fail if the existing data does not meet the new constraints. For example, adding a NOT NULL constraint to a column that already contains NULL values will cause the migration to fail.

What makes this even more confusing is that such migrations will often succeed in a development or testing environment where the data is different from production. This can lead to a false sense of confidence that the migration is safe to apply in production.

Breaking Changes

mysql> select id, renamed_column from candy limit 1;

ERROR 1054 (42S22): Unknown column 'renamed_column' in 'field list'

A popular topic amongst data teams today is "data contracts" - the (sometimes implicit) agreement between the applications and their downstream consumers about the structure and semantics of the data. This is often mentioned in the context of data engineering teams building data pipelines, but the same concerns apply to the interface between your application backend and the database.

When a migration changes the structure of the database, it can break the contract between the application and the database, leading to runtime errors and potentially to data corruption. For example, renaming a column that is used by the application will cause queries to fail, leading to downtime and unhappy users.

Table locks

mysql> INSERT INTO candy (name, kind) VALUES ('kif-kef', 'chocolate');

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Migrations that involve large tables or complex operations can cause table locks that prevent other queries from executing. This can lead to timeouts, deadlocks, and other performance issues that affect the availability of the application.

For instance, suppose your MySQL table has an enum column with a million rows, and you want to add a new value to the enum. If you add the new value, not at the end of the enum, but in the middle (for example from ('a', 'b', 'c') to ('a', 'b', 'd', 'c')), MySQL will lock the table for the duration of the migration, while it is rewriting the column on disk.

Preventing Risky Migrations

To prevent these common mistakes, Atlas provides a set of safety checks that run automatically before applying a migration. These checks analyze the migration and the database schema to identify potential issues and warn the developer before applying the migration.

Such checks can be run either locally during development or in CI before approving the migration and merging it into the main branch. By catching issues early, these safety checks help prevent downtime and data loss caused by risky migrations.

For instance, Atlas might warn the developer about a potentially destructive change like dropping a table:

atlas migrate lint --env local --latest 1

Might produce the following output:

Analyzing changes from version 20240929125035 to 20240929125127 (1 migration in total):

-- analyzing version 20240929125127
-- destructive changes detected:
-- L2: Dropping non-virtual column "email"
https://atlasgo.io/lint/analyzers#DS103
-- suggested fix:
-> Add a pre-migration check to ensure column "email" is NULL before dropping it
-- ok (225.916µs)

-------------------------
-- 98.996916ms
-- 1 version with warnings
-- 1 schema change
-- 1 diagnostic

Being able to identify and fix these issues while working locally can save a lot of time and headaches down the road, but this is further amplified when running these checks in CI, where teams can ensure that no risky migrations are merged into the main branch.

Strategy 5: Pre-migration checks

As we mentioned above, safety checks can catch common mistakes before applying a migration, but they are not foolproof. Some changes depend on the state of the data in the database, which cannot be determined statically by analyzing the migration script.

For example, consider a migration that adds a NOT NULL constraint to a column that already contains NULL values. A safety check can warn the developer about this potential issue, but it cannot guarantee that the migration will succeed in all cases. Similarly, a migration that drops a column might be safe and reversible if the column is empty, but risky if it contains valuable data.

To handle these cases, Atlas provides a mechanism for defining pre-migration checks that run before applying the migration. These checks can analyze the state of the database and the data to determine if the migration is safe to apply. In case of an issue, the check can prevent the migration from running and provide guidance on how to resolve the issue.

For instance, the code below defines a pre-migration check that ensures the table users is empty before dropping it:

20240201131900_drop_users.sql
-- atlas:txtar

-- checks.sql --
-- The assertion below must be evaluated to true. Hence, table users must be empty.
SELECT NOT EXISTS(SELECT * FROM users);

-- migration.sql --
-- The statement below will be executed only if the assertion above is evaluated to true.
DROP TABLE users;

Summary

Database schema migrations are a critical part of the software development lifecycle, but they can also be a significant source of risk and downtime if not handled carefully. By following the strategies outlined in this post and using tools like Atlas, teams can make their database schema migrations more reliable and reduce the risk of downtime and data loss.

To summarize, here are the five strategies for reliable schema migrations:

StrategyImpact
Schema-as-codeAutomated migration planning
Test schema logicPrevent regressions in DB logic
Test data migrationsPrevent data corruption
Automated quality checksCatch risky changes in Dev/CI
Pre-migration checksVerify data dependent changes

We hope you enjoyed this post and found it useful. As always, we welcome your feedback and suggestions on our Discord server.

Atlas: Like Terraform, but for Databases

· 9 min read
Ariel Mashraki
Building Atlas

Hello everyone,

Today, we're excited to release the new schema plan command, which many of you have been eagerly awaiting.

Taking the declarative workflow to the next level, the schema plan command lets you review, analyze and even edit declarative migration plans at pull-request stage, making schema apply much safer and predictable. Additionally, several new features have been added to Atlas in this release, and we'll cover them in this blog post as well.

What is Atlas?

For those visiting us for the first time, Atlas is a language-agnostic tool for managing and migrating database schemas using modern DevOps principles. Users define their desired database schema state declaratively, and Atlas handles the rest. The "state" can be defined using SQL, HCL (Atlas flavor), your preferred ORM, another database, or a combination of all. To get started, visit the getting-started doc.

Why schema plan?

Since the first release, Atlas supports declarative migrations. Using the schema apply command, users provide the desired schema, and a URL (connection string) to the target database, and Atlas computes the migration plan, and applies it to the database after the user approves it. This workflow is very similar to Terraform, but for databases schemas.

Although the declarative workflow feels magical, and works well for most cases, it had some inherent limitations:

  1. Since changes are computed at runtime, reviews also happen at runtime, either by policy (explained below) or manually. This creates a less predictable and streamlined deployment process compared to applications development, where code reviews occur during the pull request (PR) stage. Since Atlas promotes the "Schema as Code" approach, we aim to bring the same experience to database schema changes.
  2. Another limitation of this workflow is that users can define the desired state but have no control on the exact steps to reach it. Although Atlas provides a set of diff policies to fine-tune migration planning, users sometimes need more control over how the migrations are applied.
  3. Data changes, like back-filling columns with custom UPDATE statements, are difficult to express declaratively.

Fortunately, since Atlas provides also a versioned workflow, companies faced these limitations have been able to fall back to it. While versioned migration has its own limitations (like history linearity), it still works well for most cases. Combined with Atlas's automatic migration planning, the overall experience is closely to the declarative migration, but not the same.

We believe that declarative migration is the future for most cases. It lets engineers focus on feature development, not migrations. Additionally, this workflow allows schema transitions between any states, generating the most efficient plan, unlike versioned migration, which relies on a linear history of changes.

We address these limitations by introducing the schema plan command. Let's dive in.

What is schema plan?

The atlas schema plan command allows users to pre-plan, review, and approve declarative migrations before executing them on the database. It lets users modify the SQL migration plan (if necessary), involve team members in the review, and ensure the approval is done at development stage, and no human intervention is needed during deployment (atlas schema apply) stage.

How does it work? Users modify their schema code (e.g., ORM models, SQL or HCL) and open a PR with the changes. Then, Atlas computes the migration plan, runs analysis, and simulates it on a dev-database. Lastly, it comments on the PR with the results:

GitHub Action for schema plan command

Plan Generated by atlas schema plan

Once the PR is approved and merged, the plan is saved in the Atlas Registry in a "ready to be applied" state. During deployment (schema apply), Atlas checks for any pre-planned migration for the given schema transition (State1 -> State2) and uses it if available, otherwise falling back to other approval policies.

This process can also be done locally, allowing users to plan and approve locally, then apply remotely.

If you follow our blog, you know we love practical examples. To maintain this tradition and demonstrate the new command, let’s dive into an example.

Example

Before running atlas schema plan, let's ensure that a schema repository named app exists in Atlas Registry and there is a database containing the previous schema state (before our changes):

schema.sql
CREATE TABLE users (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT
);

We run atlas schema push to create the schema in Atlas Registry:

atlas schema push --env local

Schema: app
-- Atlas URL: atlas://app
-- Cloud URL: https://a8m.atlasgo.cloud/schemas/141733920781

Then, we run atlas schema apply to align the database with the schema state:

atlas schema apply --env local --auto-approve

At this stage, our database main.db contains the users table with the id and name columns.

Changing the Schema

Suppose we want to add a non-nullable email column to the users table. Let's update the schema.sql file and then run atlas schema plan to generate a migration plan.

schema.sql
CREATE TABLE users (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
email TEXT NOT NULL
);

We run atlas schema plan to generate a migration plan for adding the email column to the users table:

atlas schema plan --env local

The output looks like this:

Planning migration from local database to file://schema.sql (1 statement in total):

-- add column "email" to table: "users":
-> ALTER TABLE `users` ADD COLUMN `email` text NOT NULL;

-------------------------------------------

Analyzing planned statements (1 in total):

-- data dependent changes detected:
-- L2: Adding a non-nullable "text" column "email" will fail in case table "users"
is not empty https://atlasgo.io/lint/analyzers#MF103
-- ok (346.192µs)

-------------------------
-- 5.038728ms
-- 1 schema change
-- 1 diagnostic
? Approve or abort the plan:
▸ Approve and push
Abort

Data-Dependent Changes

Atlas detects data-dependent changes in the migration plan and provides a diagnostic message. In this case, it warns that adding the non-nullable email column, will fail if the users table is not empty. The recommended solution is to provide a default value for the new column. Let's fix this by adding a default value to the email column and re-run the atlas schema plan command.

schema.sql
CREATE TABLE users (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
email TEXT NOT NULL DEFAULT 'unknown'
);

Then, we run atlas schema plan again to generate a new migration plan, but this time, we approve it:

atlas schema plan --env local
Planning migration from local database to file://schema.sql (1 statement in total):

-- add column "email" to table: "users":
-> ALTER TABLE `users` ADD COLUMN `email` text NOT NULL DEFAULT 'unknown';

-------------------------------------------

Analyzing planned statements (1 in total):

-- no diagnostics found

-------------------------
-- 6.393773ms
-- 1 schema change
? Approve or abort the plan:
▸ Approve and push
Abort

Once approved, the migration plan will be pushed to the Atlas Registry, and can be applied using atlas schema apply.

Plan Status: APPROVED
-- Atlas URL: atlas://app/plans/20240923085308
-- Cloud URL: https://a8m.atlasgo.cloud/schemas/141733920769/plans/210453397504

At this stage, we can run atlas schema apply to apply the changes to the database, on any environment, without re-calculating the SQL changes at runtime or requiring human intervention.

Applying approved migration using pre-planned file 20240923085308 (1 statement in total):

-- add column "email" to table: "users"
-> ALTER TABLE `users` ADD COLUMN `email` text NOT NULL DEFAULT 'unknown';
-- ok (749.815µs)

-------------------------
-- 802.902µs
-- 1 migration
-- 1 sql statement

Atlas Registry

Starting with this release, Atlas Registry supports the declarative workflow. It allows you to store, version, and maintain a single source of truth for your database schemas and their migration plans.

It is similar to DockerHub, but for your schemas and migrations. In addition to functioning as storage and Atlas state management, it is schema-aware and provides extra capabilities such as ER diagrams, SQL diffing, schema docs, and more.

atlas migrate push

Schema pushed with atlas schema push

What else is new?

In addition to the schema plan command, we have added several new features and improvements to Atlas. Here are some highlights:

  1. Users running atlas schema apply with a Pro license will now receive a detailed migration linting report and can control the approval based on it. Read more about the Review and Approval Policies.
  2. The schema apply command now supports the --edit flag, allowing users to safely edit the migration plan before applying it. Note that if your manual changes are not in sync with the desired state, Atlas will detect schema drift and reject the changes.
  3. The GitHub Action and gh extension for Atlas have been updated to support the new declarative workflow.
  4. The ClickHouse driver now supports Dictionaries.
  5. The docker block in Atlas config now supports build blocks, allowing users to use custom Docker images for their dev-databases.
  6. The PostgreSQL driver now supports configuring DEFERRABLE constraints on primary keys, foreign keys, unique, and exclusion constraints.
  7. The external command was added to the Atlas testing framework, allowing users to run custom commands during the testing phase.

Wrapping Up

That's all for this release! But, we are already working on several features and improvements in the pipeline. To be transparent with our community, here is a look at what's coming next:

  1. Partition support for the PostgreSQL driver.
  2. CircleCI, GitLab CI, Kubernetes Operator, and Terraform Provider will support the new declarative workflow.
  3. A new schema lint command, allowing users to lint their schemas with built-in and custom analyzers.
  4. A Prisma provider for Atlas, enabling Prisma users to import their Prisma schema into Atlas schema state.

We hope you enjoy the new features and improvements. As always, we would love to hear your feedback and suggestions on our Discord server.

Pricing Changes for Atlas Pro

· 2 min read
Rotem Tamir
Building Atlas

Hey Atlas Users,

We wanted to share some important updates regarding changes to our Atlas Pro plan that will come into effect soon.

What's Changing?

Starting September 27th, 2024, Atlas Pro seats will no longer be available for free. New users will still have access to a 30-day free trial, but after that, a license will be required to continue using Atlas Pro.

For our existing users, don't worry! We've got you covered with a 30-day grace period to upgrade your account. Plus, we're offering each company credits that should cover their bill for an additional month.

For our existing paying users, as a token of our appreciation, you'll continue to receive 3 free seats on top of what you've already paid for, which means you are not impacted by this change. We value your continued support and want to make sure you're getting even more out of Atlas Pro.

We also recognize that many of you are using Atlas Pro for non-commercial projects, so we're excited to introduce the Atlas Pro Hacker License! This license offers full access to Atlas Pro features for students, maintainers of open-source projects, and hobbyists working on non-commercial software.

Why Are We Doing This?

At its core, Atlas has always been an open-source project available for anyone to use. However, to ensure the long-term sustainability and success of Atlas, we believe it's important to have a strong, profitable business model. This allows us to continue building and maintaining Atlas for the future, while fostering a vibrant community around it.

What You Need to Do

To continue using Atlas Pro, you'll need to upgrade your organization's account by October 27th, 2024. You can do this by heading over to the Settings page for your organization, where you'll be able to purchase Atlas Pro through Stripe or directly via AWS Marketplace.

  • Apply for company credits here.
  • Apply for the free Atlas Pro Hacker License here.

We appreciate your support and understanding as we make these changes. If you have any questions or need assistance, don't hesitate to reach out to us via our Discord server or at hello@ariga.io!

All the best, Ariel and Rotem

How Go Tests "go test"

· 21 min read
Rotem Tamir
Building Atlas

Adapted from a talk given at GopherCon Israel 2024

Introduction

How does Go, the project, and team behind it, test go test, the Go tool's command for running tests? Does Go test go test using the go test command? In this article, we explore the evolution of how the Go team tests the Go tool (go) and discuss strategies for testing command-line tools written in Go in general.

CLIs and Go

If you are a software engineer in 2024, you are most likely using a CLI tool written a tool to perform some critical part of your work. Perhaps you're using docker, to build and run container images or kubectl to interact with a kubernetes cluster. Maybe you're using terraform to manage your infrastructure as code. Maybe you're using atlas (the project I work on) to manage your database schema as code. You could be using trivy to scan your code for vulnerabilities or gh to interact with your code on GitHub.

Go is a fantastic language for writing CLI tools, and today we're going to try and study some of the strategies that you can employ to test CLI tools by looking at how the Go team tests the go tool.

Motivation

My personal motivation for digging into this topic arose from my work on Atlas, a database schema as code tool. Atlas is a CLI written in Go (see our GitHub repo), that has sometimes been described as a "Terraform for databases." It is used by many companies big and small to streamline their database schema management process.

One of the first decisions Ariel (my co-founder) and I made when we started to work on Atlas was that we were going to be employing a continuous delivery strategy, shipping new features and bug fixes to our users as soon as they were ready, often times multiple times a day. This meant that we needed to have a robust testing strategy in place to ensure that we were shipping high-quality software to our users. After all, we were building a tool that was going to be used by other developers to manage their most critical data assets.

Testing CLI tools

Before we dive into how the Go team tests the go tool, let's take a step back and think about what CLI testing is all about. Testing CLIs has it's unique challenges, but at the end of the day, it's very similar to testing any other piece of software.

As with all automated tests, we can identify four discrete phases with CLI tests which I characterize as the "Quadruple A" of testing:

  • Arrange: We setup the environment for the test. For CLI tests this typically involves creating temporary files, and setting up environment variables.
  • Act: When testing server software we would issue a request, but when testing CLIs, this means executing the binary under test, often supplying it with command-line arguments, flags, and potentially piping data into STDIN.
  • Assert: We consume the output streams (STDOUT, STDERR) and compare them to expected values. We also check the exit code of the process, and any side effects that the command may have had on the environment.
  • And... cleanup: We clean up the environment, removing any temporary files, and resetting any environment variables that we may have changed. Failing to do this can lead to flaky tests - which debugging is arguably one of the worst things in software development.

How the Go team tests go test

With that in mind let's now explore how testing the go tool has evolved over time.

This section is mostly built upon a terrific and detailed commit message on CL #123577 by Russ Cox. I highly recommend reading the original commit message for a more detailed explanation of the evolution of the Go test suite.

2012-2015: test.bash

In the early days of Go, the Go test suite was tested using a shell script called test.bash. This script started out as a simple 30-40 line script that ran the go tool with various flags and options and checked the output. Over time, as the Go tool grew in complexity, so did the test.bash script. It eventually grew to be a 1500+ line shell script that tested the go tool in a variety of ways. The tests looked something like this:

TEST 'file:line in error messages'
# Test that error messages have file:line information at beginning of
# the line. Also test issue 4917: that the error is on stderr.
d=$(TMPDIR=/var/tmp mktemp -d -t testgoXXX)
fn=$d/err.go
echo "package main" > $fn
echo 'import "bar"' >> $fn
./testgo run $fn 2>$d/err.out || true
if ! grep -q "^$fn:" $d/err.out; then
echo "missing file:line in error message"
cat $d/err.out
ok=false
fi
rm -r $d

If you examine the test above, you will see that it is comprised of the same four phases that we discussed earlier: Arrange, Act, Assert, and Cleanup:

  • Arrange: The test creates a temporary directory and a temporary file.
  • Act: The test runs the go tool with the run subcommand and pipes the output to a file.
  • Assert: The test checks that the output contains the filename and line number of the error.
  • Cleanup: The test removes the temporary directory.

Russ writes about the test.bash script:

The original cmd/go tests were tiny shell scripts written against a library of shell functions.

They were okay to write but difficult to run: you couldn't select individual tests (with -run) they didn't run on Windows, they were slow, and so on.

The tests had always been awkward to write.

2015-2018: testgo

In June 2015, CL #10464 introduced go_test.go. This file contained a basic framework for writing Go tests for the go tool named testgo. The same test from above, written in Go, looked something like this:

func TestFileLineInErrorMessages(t *testing.T) {
tg := testgo(t)
defer tg.cleanup()
tg.parallel()
tg.tempFile("err.go", `package main; import "bar"`)
path := tg.path("err.go")
tg.runFail("run", path)
shortPath := path
if rel, err := filepath.Rel(tg.pwd(), path);
err == nil && len(rel) < len(path) {
shortPath = rel
}
tg.grepStderr(
"^"+regexp.QuoteMeta(shortPath)+":",
"missing file:line in error message",
)
}

As you can see, the test is still comprised of the same four phases: Arrange, Act, Assert, and Cleanup:

  • Arrange: The test creates a temporary file.
  • Act: The test runs the go tool with the run subcommand.
  • Assert: The test checks that the output contains the filename and line number of the error.
  • Cleanup: The test removes the temporary file. (this happens in the defer tg.cleanup() call)

Russ writes about the testgo framework:

“CL 10464 introduced go_test.go's testgo framework and later CLs translated the test shell script over to individual go tests. This let us run tests selectively, run tests on Windows, run tests in parallel, isolate different tests, and so on.

It was a big advance. It's better but still quite difficult to skim.”

2018-?: script_test.go

Most teams and projects that I know would stop here. Go's testing infrastructure, the testing package, as well as the accompanying go test tool is terrific. When coupled with some thoughtful library code, testing CLIs in Go can be a breeze. But the Go team didn't stop there. In 2018, CL #123577 introduced a new testing framework for the go tool called script_test.go.

Russ writes about it:

script_test.go brings back the style of writing tests as little scripts, but they are now scripts in a built-for-purpose shell-like language, not bash itself.

Under script_test.go, test cases are described as txt files which are txtar archives containing the test script and any accompanying files. Here's the "Hello, world" example for script_test:

# src/cmd/go/testdata/script/run_hello.txt

# this is a txtar archive

# run hello.go (defined below)
go run hello.go

# assert ‘hello world’ was printed to stderr
stderr 'hello world'

-- hello.go --
package main
func main() { println("hello world") }

As before, the test comprises the same four phases: Arrange, Act, Assert, and Cleanup:

  • Arrange: The test creates a temporary file, defined by the -- hello.go -- section.
  • Act: The test runs the go tool with the run subcommand on the temporary file.
  • Assert: The test checks that the output contains the string hello world.
  • Cleanup: Where is the cleanup code? We'll explore that in a moment.

How does script_test.go work?

script_test does a lot of cool things under the hood that makes it ideal for testing a CLI tool:

  1. Each script becomes a Go sub-test, which means from the perspective of go test, it's a normal test, that can be run in parallel, skipped, or run with the -run flag.
  2. script_test creates an isolated sandbox for each test, so that tests can't interfere with each other. Doing so enables it to run tests in parallel, which can significantly speed up the test suite.
  3. The files defined in the txtar archive are created in the sandbox, and the test is run in that directory.

After setting up, script_test runs the test script commands, line by line:

  1. Commands are run in sequence, and the output is captured into stdout and stderr buffers.
  2. Commands are expected to succeed, unless explicitly negated with ! at the beginning of the line.
  3. Many helpful assertion commands such as stdout (to check the stdout buffer), stderr (to check the stderr buffer), and cmp (to compare the contents of two files) are available.

As for cleanup, script_test automatically cleans up the sandbox after each test, removing all files and directories created during the test.

Can I use script_test.go for my CLI?

If you are writing a CLI tool in Go, I hope by now you are pretty excited about script_test.go. Wouldn't you love to have a testing framework that allows you to write tests in a shell-like language, that can be run in parallel, and that automatically cleans up after itself?

You are probably asking yourself, "Can I use script_test.go for my CLI?"

Well, surprisingly enough, the answer is:

No, you can't.

script_test.go is under an internal package in the Go repository, and it is pretty tightly coupled to the go tool.

The End.

Or is it?

Introducing testscript

In late 2018, Roger Peppe, a long time Go user, contributor and member of the Go community created a repo named rogpeppe/go-internal to factor out some useful internal packages from within the Go codebase. One of these packages is testscript, which is based on the work the Go team created for script_test.

Roger was kind enough to speak with me in preparation for this talk, so I hope that even if you've read about it before, I can share some new things you haven't heard.

script_test.go, as we mentioned, never exposed a public API, and so over the past 6 years, the package gained steam and popularity, especially among Go "insiders" - people who knew about script_test, but couldn't use it. Today, according to public GitHub data, go-internal is depended upon by over 100K repositories on GitHub.

(As a side-note, Roger pointed out to me that it's difficult to get the exact number of projects that use testscript itself, as the go.dev site omits any dependencies that run through test code. If you look at go.dev it shows that only 14 (!) packages import it)

Because script_test never had a public API, and was very tightly coupled to testing the Go tool codebase, testscript should be thought of as more of a conceptual "factoring out" than a 1:1 exporting.

Over time, many features that weren't available in the original implementation, such as generating coverage data, a standalone CLI, and auto-updated of Golden files was added.

As I will show later, testscript is a fantastic tool and we have been utilizing it in the Atlas codebase for a long time with great success. However, it is worth mentioning that in November 2023, Russ Cox published a similar package named rsc.io/script which is also based on the script_test codebase. I haven't used it myself, but it's worth checking out.

Our example CLI: wordwrap

To demonstrate how testscript works, I've created a simple CLI tool named wordwrap. wordwrap is a simple tool that takes a path and applies simple word wrapping to all .txt files in that path. You can find the code on GitHub. Wordwrap has a few features that we would like to test:

On the simple case, suppose our current working directory contains a file named example.txt with the following content:

This is a text file with some text in it. To demonstrate wordwrap, it has more than 40 chars.

Running wordwrap:

go run ./cmd/wordwrap -path ./dir-with-txt-files

Our example.txt file would be transformed into:

This is a text file with some text in
it. To demonstrate wordwrap, it has more
than 40 chars.

By default, wordwrap wraps lines at 40 characters, but you can specify a different line length with the -width flag:

go run ./cmd/wordwrap -path ./dir-with-txt-files -width 20

Would wrap the lines at 20 characters:

This is a text file
with some text in
it. To demonstrate
wordwrap, it has
more
than 40 chars.

To make things more interesting, we have also added a -strict flag that will cause wordwrap to fail if any line in the file is longer than the specified width. For example, suppose our example.txt file contains a word that is 34 characters long:

It's supercalifragilisticexpialidocious
Even though the sound of it is something quite atrocious
If you say it loud enough you'll always sound precocious

Running wordwrap with the -strict flag and a width of 20:

go run ./cmd/wordwrap -path ./hack -width 20 -strict

Would fail with an error message:

file hack/example.txt: line 2 exceeds specified width 20
exit status 1

Writing tests with testscript

Let's see how to write tests for wordwrap using testscript.

To set up, first create a file named wordwrap_test.go in your projects and create the following boilerplate code:

package wordwrap_test

import (
"bufio"
"os"
"testing"
"github.com/rogpeppe/go-internal/testscript"

"rotemtam.com/wordwrap"
)

func TestMain(m *testing.M) {
os.Exit(testscript.RunMain(m, map[string]func() int{
"wordwrap": wordwrap.Run,
}))
}

func TestScript(t *testing.T) {
testscript.Run(t, testscript.Params{
Dir: "testdata",
})
}

Here's what's happening in the code above:

  1. Our TestMain function is a setup function that prepares the test environment. It uses testscript.RunMain to tell testscript that it should create a custom command wordwrap that runs the wordwrap.Run function. This simulates having a binary named wordwrap that runs our program's main function.
  2. TestScript is where the actual magic happens. It uses testscript.Run to run the tests in the testdata directory. The testdata directory contains the test scripts that we will write in the next step.

Our first test script

Let's create a file named testdata/basic.txt with the following content:

wordwrap

cmp basic.txt basic.golden

-- basic.txt --
This is a text file with some text in it. To demonstrate wordwrap, it has more than 40 chars.

-- basic.golden --
This is a text file with some text in
it. To demonstrate wordwrap, it has more
than 40 chars.

As before, you will find our test script is comprised of the same four phases: Arrange, Act, Assert, and Cleanup:

  • Arrange: The test creates a temporary file, defined by the -- basic.txt -- section.
  • Act: The test runs the wordwrap command.
  • Assert: The test compares the output to the contents of the basic.golden file. This is done using the included cmp command.
  • Cleanup: There is no explicit cleanup in this test, as testscript will automatically clean up the sandbox after the test.

The awesome thing about testscript, is that from go test's perspective, basic is just a regular Go test. This means that we can execute it as we would any other test:

go test -v ./... -run TestScript/basic

This is the output you should see:

=== RUN   TestScript
=== RUN TestScript/basic
=== PAUSE TestScript/basic
=== CONT TestScript/basic
testscript.go:558: WORK=$WORK
# --- redacted for brevity ---
> wordwrap
> cmp basic.txt basic.golden
PASS

--- PASS: TestScript (0.00s)
--- PASS: TestScript/basic (0.15s)
PASS
ok rotemtam.com/wordwrap (cached)

A more involved test script

Next, let's create a more involved test script that verifies additional behavior in wordwrap. Create a file named testdata/dont-touch.txt with the following content:

wordwrap -path p1.txt

! stderr .

cmp p1.txt p1.golden

exec cat dont-touch.txt
stdout 'This file shouldn''t be modified, because we invoke wordwrap with a path argument.'

-- p1.txt --
Don't communicate by sharing memory, share memory by communicating.

-- p1.golden --
Don't communicate by sharing memory,
share memory by communicating.

-- dont-touch.txt --
This file shouldn't be modified, because we invoke wordwrap with a path argument.

This test verifies that wordwrap doesn't modify files that are not passed as arguments. The test script is comprised of the same phases.

  • Arrange: The test creates p1.txt, which is the file we are going to modify, and dont-touch.txt, which is the file we don't want to modify.
  • Act: The test runs the wordwrap command with the -path flag.
  • Assert: The test compares the output to the contents of the p1.golden file. This is done using the included cmp command. The test also verifies that the dont-touch.txt file hasn't been modified.
  • Cleanup: There is no explicit cleanup in this test, as testscript will automatically clean up the sandbox after

Testing the -width flag

In addition, we should probably verify that the -width flag works as expected. Create a file named testdata/width.txt:

skip

wordwrap -width 60

cmp effective.txt effective.golden

-- effective.txt --
This document gives tips for writing clear, idiomatic Go code. It augments the language specification, the Tour of Go, and How to Write Go Code, all of which you should read first.

Note added January, 2022: This document was written for Go's release in 2009, and has not been updated significantly since.

-- effective.golden --
This document gives tips for writing clear, idiomatic Go
code. It augments the language specification, the Tour of
Go, and How to Write Go Code, all of which you should read
first.

Note added January, 2022: This document was written for Go's
release in 2009, and has not been updated significantly
since.

This test script verifies that the -width flag works as expected. The test script is comprised of the same phases.

This works, but I didn't love writing it. Creating the .golden file by hand is a bit tedious, and it's easy to make mistakes. In this case, wouldn't it be great if we could create a custom command that verifies that the output is wrapped at 60 characters?

Thankfully, testscript allows us to create custom commands. Let's create a custom command named maxlen that verifies that the output is wrapped at a maximum of n characters. Add the following code to wordwrap_test.go:

// maxline verifies that the longest line in args[0] is shorter than args[1] chars.
// Usage: maxline <path> <maxline>
func maxline(ts *testscript.TestScript, neg bool, args []string) {
if len(args) != 2 {
ts.Fatalf("usage: maxline <path> <maxline>")
}
l, ok := strconv.Atoi(args[1])
if ok != nil {
ts.Fatalf("usage: maxline <path> <maxline>")
}
scanner := bufio.NewScanner(
strings.NewReader(
ts.ReadFile(args[0]),
),
)
tooLong := false
for scanner.Scan() {
if len(scanner.Text()) > l {
tooLong = true
break
}
}
if tooLong && !neg {
ts.Fatalf("line too long in %s", args[0])
}
if !tooLong && neg {
ts.Fatalf("no line too long in %s", args[0])
}
}

In order to use the maxline command in our test scripts, we need to register it with testscript. Update the TestScript function in wordwrap_test.go to include the following code:

func TestScript(t *testing.T) {
testscript.Run(t, testscript.Params{
Dir: "testdata",

Cmds: map[string]func(ts *testscript.TestScript, neg bool, args []string){
"maxline": maxline,
},
})
}

Now we can use the maxline command in our test scripts. Create a new test named testdata/width-custom.txt with the following content:

wordwrap -width 60

! maxline effective.txt 20
maxline effective.txt 60

wordwrap -width 40
! maxline effective.txt 20
maxline effective.txt 40

wordwrap -width 20
maxline effective.txt 20

-- effective.txt --
This document gives tips for writing clear, idiomatic Go code. It augments the language specification, the Tour of Go, and How to Write Go Code, all of which you should read first.

Note added January, 2022: This document was written for Go's release in 2009, and has not been updated significantly since.

Running this test script will verify that the output is wrapped at 60 characters, 40 characters, and 20 characters:

go test -v ./... -run TestScript/width-custom

Output:

?       rotemtam.com/wordwrap/cmd/wordwrap      [no test files]
=== RUN TestScript
=== RUN TestScript/width-custom
=== PAUSE TestScript/width-custom
=== CONT TestScript/width-custom
testscript.go:558: WORK=$WORK
# --- redacted for brevity ---
> wordwrap -width 60
> ! maxline effective.txt 20
> maxline effective.txt 60
> wordwrap -width 40
> ! maxline effective.txt 20
> maxline effective.txt 40
> wordwrap -width 20
> maxline effective.txt 20
PASS

--- PASS: TestScript (0.00s)
--- PASS: TestScript/width-custom (0.19s)
PASS
ok rotemtam.com/wordwrap 0.626s

Testing strict mode

Finally, let's create a test script that verifies that the -strict flag works as expected. Create a file named testdata/strict.txt with the following content:

! wordwrap -path poppins.txt -width 20 -strict
stderr 'line 2 exceeds specified width 20'

wordwrap -path poppins.txt -width 20
cmp poppins.txt poppins.golden
-- poppins.txt --
It's supercalifragilisticexpialidocious
Even though the sound of it is something quite atrocious
-- poppins.golden --
It's
supercalifragilisticexpialidocious
Even though the
sound of it is
something quite
atrocious

This test script verifies that wordwrap fails when a line exceeds the specified width in strict mode.

Awesome!

Personal impact

Aside from being a super cool tool for writing tests for CLI tools, testscript has had a significant impact on my team. We have been using it in the Atlas codebase for a long time, and it has been a game-changer for us.

Atlas, as a schema-as-code tool, is a bridge between code (files) and databases. Thus, being able to easily write tests to verify our tool's behavior in a way that is close to how our users interact with it has been invaluable.

Over the years, we have accumulated a set of custom testscript commands that allow us to write test scripts in a fluent and intuitive way. You can see this in action in the Atlas codebase, but just to give you a taste, here is how our testscript entrypoint looks like for MySQL integration tests:

func TestMySQL_Script(t *testing.T) {
myRun(t, func(t *myTest) {
testscript.Run(t.T, testscript.Params{
Dir: "testdata/mysql",
Setup: t.setupScript,
Cmds: map[string]func(ts *testscript.TestScript, neg bool, args []string){
"only": cmdOnly,
"apply": t.cmdApply,
"exist": t.cmdExist,
"synced": t.cmdSynced,
"cmphcl": t.cmdCmpHCL,
"cmpshow": t.cmdCmpShow,
"cmpmig": t.cmdCmpMig,
"execsql": t.cmdExec,
"atlas": t.cmdCLI,
"clearSchema": t.clearSchema,
"validJSON": validJSON,
},
})
})
}

Having these commands, allow us to write test scripts that are easy to read and understand, and that verify the behavior of our tool is correct. For example:

apply 1.hcl
cmpshow users 1.sql

-- 1.hcl --
schema "main" {}

table "users" {
schema = schema.main
column "id" {
null = false
type = integer
auto_increment = true
}
primary_key {
columns = [column.id]
}
}

-- 1.sql --
CREATE TABLE `users` (`id` integer NOT NULL PRIMARY KEY AUTOINCREMENT)

This test applies an Atlas DDL schema on a SQLite database and verifies that the schema is created correctly.

Conclusion

In this article, we have explored how the Go team tests the go tool, and how you can apply similar strategies to test your CLI tools using testscript.

As a team that develops tools for other developers, we take the reliability of our tools very seriously. The key to this, we have found over the years, is to have a robust testing strategy in place. This allows us to move fast (without breaking things) and to ship high-quality software to our users.

Resources