Automatic Databricks Schema Migrations with Atlas

Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on a single platform. Built on Apache Spark, it provides a collaborative workspace for processing large-scale data workloads with high performance and reliability.

However, managing database schemas in Databricks can be challenging, especially when working with Unity Catalog's three-level namespace (catalog.schema.table) and coordinating schema changes across multiple teams and workspaces.

Enter: Atlas

Atlas helps developers manage their database schema as code, abstracting away the intricacies of database schema management. With Atlas, users provide the desired state of the database schema, and Atlas automatically plans the required migrations.

In this guide, we will dive into setting up Atlas for Databricks schema migration using the declarative workflow.

note

The Databricks Driver is currently in beta and should be used with caution in production environments.

Prerequisites

Databricks workspace and SQL warehouse:
- Access to a Databricks workspace
- A running SQL warehouse or compute cluster
- Personal Access Token (PAT) for authentication
Atlas installed on your machine:

macOS + Linux
Docker
Windows

To download and install the beta release of the Atlas CLI, simply run the following in your terminal:

curl -sSf https://atlasgo.sh | ATLAS_VERSION="beta" sh

To pull the Atlas image and run it as a Docker container:

docker pull arigaio/atlas-beta
docker run --rm arigaio/atlas-beta --help

If the container needs access to the host network or a local directory, use the --net=host flag and mount the desired directory:

docker run --rm --net=host \
  -v $(pwd)/migrations:/migrations \
  arigaio/atlas-beta migrate apply
  --url "mysql://root:pass@:3306/test"

An Atlas Pro account. To use the Databricks driver, run:

$ atlas login

Setting up Databricks Authentication

To connect Atlas to your Databricks workspace, you'll need to set up authentication using a Personal Access Token (PAT).

Creating a Personal Access Token

Log in to your Databricks workspace
Click on your username in the top-right corner and select "User Settings"
Go to the "Developer" tab
Click "Manage" next to "Access tokens"
Click "Generate new token"
Give your token a description and set an expiration (optional)
Copy the generated token and store it securely

See the Databricks documentation for more details.

Environment Setup

To get DATABRICKS_HOST and DATABRICKS_WAREHOUSE, refer to the Databricks documentation

Set your Databricks credentials as environment variables:

export DATABRICKS_TOKEN="your-personal-access-token"
export DATABRICKS_HOST="dbc-xxxxxxxx-xxxx.cloud.databricks.com"
export DATABRICKS_WAREHOUSE="/sql/1.0/warehouses/your-warehouse-id"

Creating a Databricks Schema

Let's start by creating a schema in your Databricks workspace. You can do this through the Databricks UI or by running SQL commands in a notebook or SQL editor:

-- Create an initial table
CREATE TABLE IF NOT EXISTS users (
  id BIGINT NOT NULL,
  email STRING,
  display_name STRING,
  created_at TIMESTAMP
);

Inspecting the Schema

The atlas schema inspect command supports reading the database description provided by a URL and outputting it in different formats, including Atlas DDL (default), SQL, and JSON. We will demonstrate the flow using both the Atlas DDL and SQL formats.

Atlas DDL (HCL)
SQL

To inspect our Databricks schema, use the -u flag and write the output to a file named schema.hcl:

atlas schema inspect -u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE" > schema.hcl

Open the schema.hcl file to view the Atlas schema that describes our database.

schema.hcl
table "users" {
  schema = schema.default
  column "id" {
    null = false
    type = BIGINT
  }
  column "email" {
    null = true
    type = STRING
  }
  column "display_name" {
    null = true
    type = STRING
  }
  column "created_at" {
    null    = true
    type    = TIMESTAMP
  }
  properties = {
    "delta.checkpoint.writeStatsAsJson"   = "false"
    "delta.checkpoint.writeStatsAsStruct" = "true"
    "delta.enableDeletionVectors"         = "true"
    "delta.feature.appendOnly"            = "supported"
    "delta.feature.deletionVectors"       = "supported"
    "delta.feature.invariants"            = "supported"
    "delta.minReaderVersion"              = "3"
    "delta.minWriterVersion"              = "7"
    "delta.parquet.compression.codec"     = "zstd"
  }
}
schema "default" {
}

To inspect our Databricks schema, use the -u flag and write the output to a file named schema.sql:

atlas schema inspect -u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE" --format '{{ sql . }}' > schema.sql

Open the schema.sql file to view the inspected SQL schema that describes our database.

schema.sql
-- Add new schema named "default"
CREATE SCHEMA `default`;
-- Create "users" table
CREATE TABLE `default`.`users` (
  `id` BIGINT NOT NULL,
  `email` STRING,
  `display_name` STRING,
  `created_at` TIMESTAMP DEFAULT current_timestamp()
) TBLPROPERTIES ('delta.checkpoint.writeStatsAsJson' = 'false', 'delta.checkpoint.writeStatsAsStruct' = 'true', 'delta.enableDeletionVectors' = 'true', 'delta.feature.allowColumnDefaults' = 'supported', 'delta.feature.appendOnly' = 'supported', 'delta.feature.deletionVectors' = 'supported', 'delta.feature.invariants' = 'supported', 'delta.minReaderVersion' = '3', 'delta.minWriterVersion' = '7', 'delta.parquet.compression.codec' = 'zstd');

info

For in-depth details on the atlas schema inspect command, covering aspects like inspecting specific schemas, handling multiple schemas concurrently, excluding tables, and more, refer to our documentation here.

Declarative Migrations

Atlas supports a workflow called declarative schema migrations. In this workflow, you first define the desired state of your database schema (in one of many supported formats and languages). Then, Atlas calculates the diff between the desired state and the actual state of your database, and generates the SQL commands that will bring your database to the desired state.

Let's see this in action.

First, create a new file named schema.sql. This file will contain the desired state of our database in plain SQL.

schema.sql
-- Create "users" table
CREATE TABLE IF NOT EXISTS users (
  id BIGINT NOT NULL,
  email STRING,
  display_name STRING,
  created_at TIMESTAMP,
  CONSTRAINT `pk_users` PRIMARY KEY (`id`)
);

-- Create "posts" table with foreign key relationship
CREATE TABLE `posts` (
  `id` BIGINT NOT NULL,
  `title` STRING,
  `content` STRING,
  `author_id` BIGINT,
  `created_at` TIMESTAMP,
  CONSTRAINT `fk_posts_author` FOREIGN KEY (`author_id`) REFERENCES `users` (`id`)
);

Applying our Schema

To apply our desired state to our Databricks database, we need to provide Atlas with two database connections:

The target database connection (the database we want to apply these changes to)
A dev database connection (a separate database used by Atlas to normalize our schema and generate the migration plan)

Let's create a new catalog and schema in our Databricks workspace to use as our dev database:

-- Create a new catalog named "dev_catalog"
CREATE CATALOG IF NOT EXISTS dev;

Next, we apply this schema to our database using the atlas schema apply command.

atlas schema apply \
  -u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE" \
  --to file://schema.sql --tx-mode="none" \
  --dev-url "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE?catalog=dev"

Atlas will connect to our target database to inspect its current state. Next, it will use the dev database to normalize our schema and generate the SQL commands that will bring our database to the desired state:

Planning migration statements (2 in total):

  -- add primary key "pk_users" to "users" table:
    -> ALTER TABLE `default`.`users` ADD CONSTRAINT `pk_users` PRIMARY KEY (`id`);
  -- create "posts" table:
    -> CREATE TABLE IF NOT EXISTS `default`.`posts` (
         `id` BIGINT NOT NULL,
         `title` STRING,
         `content` STRING,
         `author_id` BIGINT,
         `created_at` TIMESTAMP,
         CONSTRAINT `fk_posts_author` FOREIGN KEY (`author_id`) REFERENCES `default`.`users` (`id`)
       ) TBLPROPERTIES ('delta.checkpoint.writeStatsAsJson' = 'false', 'delta.checkpoint.writeStatsAsStruct' = 'true', 'delta.enableDeletionVectors' = 'true', 'delta.feature.appendOnly' = 'supported', 'delta.feature.deletionVectors' = 'supported', 'delta.feature.invariants' = 'supported', 'delta.minReaderVersion' = '3', 'delta.minWriterVersion' = '7', 'delta.parquet.compression.codec' = 'zstd');

-------------------------------------------

Analyzing planned statements (2 in total):

  -- no diagnostics found

  -------------------------
  -- 12.222772167s
  -- 2 schema changes

-------------------------------------------

? Approve or abort the plan:
  ▸ Approve and apply
    Abort

After applying the schema, Atlas confirms that the changes were applied:

Applying approved migration (2 statements in total):

  -- add primary key "pk_users" to "users" table
    -> ALTER TABLE `default`.`users` ADD CONSTRAINT `pk_users` PRIMARY KEY (`id`);
  -- ok (730.959ms)

  -- create "posts" table
    -> CREATE TABLE IF NOT EXISTS `default`.`posts` (
         `id` BIGINT NOT NULL,
         `title` STRING,
         `content` STRING,
         `author_id` BIGINT,
         `created_at` TIMESTAMP,
         CONSTRAINT `fk_posts_author` FOREIGN KEY (`author_id`) REFERENCES `default`.`users` (`id`)
       ) TBLPROPERTIES ('delta.checkpoint.writeStatsAsJson' = 'false', 'delta.checkpoint.writeStatsAsStruct' = 'true', 'delta.enableDeletionVectors' = 'true', 'delta.feature.appendOnly' = 'supported', 'delta.feature.deletionVectors' = 'supported', 'delta.feature.invariants' = 'supported', 'delta.minReaderVersion' = '3', 'delta.minWriterVersion' = '7', 'delta.parquet.compression.codec' = 'zstd');
  -- ok (1.638282292s)

  -------------------------
  -- 2.369365125s
  -- 1 migration
  -- 2 sql statements

Next, let's re-run the atlas schema apply command. This time, Atlas will detect that the database is already in the desired state and will not generate any changes:

Schema is synced, no changes to be made

Altering our Schema

Now, let's make some changes to our schema. Open the schema.sql file and add a new column to the users table:

schema.sql
  id BIGINT NOT NULL,
  email STRING,
+ bio STRING, -- new column
  display_name STRING,
  created_at TIMESTAMP,
);

Next, let's re-run the atlas schema apply command. This time, Atlas will detect that the schema has changed and will generate the needed SQL commands to bring the database to the desired state:

Planning migration statements (1 in total):

  -- add column "bio" to "users" table:
    -> ALTER TABLE `default`.`users` ADD COLUMN `bio` STRING;

-------------------------------------------

Analyzing planned statements (1 in total):

  -- no diagnostics found

  -------------------------
  -- 6.236967833s
  -- 1 schema change

-------------------------------------------

? Approve or abort the plan:
  ▸ Approve and apply
    Abort

After applying the changes, Atlas confirms once again that the changes were applied:

Applying approved migration (1 statement in total):

  -- add column "bio" to "users" table
    -> ALTER TABLE `default`.`users` ADD COLUMN `bio` STRING;
  -- ok (1.03650425s)

  -------------------------
  -- 1.036546542s
  -- 1 migration
  -- 1 sql statement

Visualizing our Schema

One of the most useful features of Atlas is the ability to visualize your database schema. To do so, run the atlas schema inspect command with the -w (web) flag:

atlas schema inspect -w -u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE"

Atlas will ask whether you would like to create your visualization publicly (in a publicly accessible URL) or privately (in your Atlas Cloud account):

? Where would you like to share your schema visualization?:
  ▸ Publicly (gh.atlasgo.cloud)
    Privately (your-username.atlasgo.cloud)

For this demo, let's choose the public option. Atlas will create the visualization and open it in your default browser, showing your Databricks schema with tables and their relationships.

See it for yourself at: https://gh.atlasgo.cloud/explore/5116ede7

Video Tutorial

Head to the repository from this video on GitHub for more workflow instructions and example files.

Wrapping Up

In this guide, we demonstrated how to set up Atlas to manage your Databricks database schema using Unity Catalog. We also demonstrated how to use some of Atlas's basic capabilities, such as declarative schema migrations and schema visualization, with a Databricks database. These two features are just the tip of the iceberg. Atlas has many more features that can help you better manage your database! To learn more, check out the Atlas documentation.

As always, we would love to hear your feedback and suggestions on our Discord server.

Enter: Atlas​

Prerequisites​

Setting up Databricks Authentication​

Creating a Personal Access Token​

Environment Setup​

Creating a Databricks Schema​

Inspecting the Schema​

Declarative Migrations​

Applying our Schema​

Altering our Schema​

Visualizing our Schema​

Video Tutorial​