Skip to main content

Automatic Databricks Schema Migrations with Atlas

Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on a single platform. Built on Apache Spark, it provides a collaborative workspace for processing large-scale data workloads with high performance and reliability.

However, managing database schemas in Databricks can be challenging, especially when working with Unity Catalog's three-level namespace (catalog.schema.table) and coordinating schema changes across multiple teams and workspaces.

Enter: Atlas

Atlas helps developers manage their database schema as code, abstracting away the intricacies of database schema management. With Atlas, users provide the desired state of the database schema, and Atlas automatically plans the required migrations.

In this guide, we will dive into setting up Atlas for Databricks schema migration using the declarative workflow.

note

The Databricks Driver is currently in beta and should be used with caution in production environments.

Prerequisites

  1. Databricks workspace and SQL warehouse:
    • Access to a Databricks workspace
    • A running SQL warehouse or compute cluster
    • Personal Access Token (PAT) for authentication
  2. Atlas installed on your machine:

To download and install the beta release of the Atlas CLI, simply run the following in your terminal:

curl -sSf https://atlasgo.sh | ATLAS_VERSION="beta" sh
  1. An Atlas Pro account. To use the Databricks driver, run:
$ atlas login

Setting up Databricks Authentication

To connect Atlas to your Databricks workspace, you'll need to set up authentication using a Personal Access Token (PAT).

Creating a Personal Access Token

  1. Log in to your Databricks workspace
  2. Click on your username in the top-right corner and select "User Settings"
  3. Go to the "Developer" tab
  4. Click "Manage" next to "Access tokens"
  5. Click "Generate new token"
  6. Give your token a description and set an expiration (optional)
  7. Copy the generated token and store it securely

See the Databricks documentation for more details.

Environment Setup

To get DATABRICKS_HOST and DATABRICKS_WAREHOUSE, refer to the Databricks documentation

Set your Databricks credentials as environment variables:

export DATABRICKS_TOKEN="your-personal-access-token"
export DATABRICKS_HOST="dbc-xxxxxxxx-xxxx.cloud.databricks.com"
export DATABRICKS_WAREHOUSE="/sql/1.0/warehouses/your-warehouse-id"

Creating a Databricks Schema

Let's start by creating a schema in your Databricks workspace. You can do this through the Databricks UI or by running SQL commands in a notebook or SQL editor:

-- Create an initial table
CREATE TABLE IF NOT EXISTS users (
id BIGINT NOT NULL,
email STRING,
display_name STRING,
created_at TIMESTAMP
);

Inspecting the Schema

The atlas schema inspect command supports reading the database description provided by a URL and outputting it in different formats, including Atlas DDL (default), SQL, and JSON. We will demonstrate the flow using both the Atlas DDL and SQL formats.

To inspect our Databricks schema, use the -u flag and write the output to a file named schema.hcl:

atlas schema inspect -u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE" > schema.hcl

Open the schema.hcl file to view the Atlas schema that describes our database.

schema.hcl
table "users" {
schema = schema.default
column "id" {
null = false
type = BIGINT
}
column "email" {
null = true
type = STRING
}
column "display_name" {
null = true
type = STRING
}
column "created_at" {
null = true
type = TIMESTAMP
default = sql("current_timestamp()")
}
property "delta.checkpoint.writeStatsAsJson" {
value = "false"
}
property "delta.checkpoint.writeStatsAsStruct" {
value = "true"
}
property "delta.enableDeletionVectors" {
value = "true"
}
property "delta.feature.appendOnly" {
value = "supported"
}
property "delta.feature.deletionVectors" {
value = "supported"
}
property "delta.feature.invariants" {
value = "supported"
}
property "delta.minReaderVersion" {
value = "3"
}
property "delta.minWriterVersion" {
value = "7"
}
property "delta.parquet.compression.codec" {
value = "zstd"
}
}
schema "default" {
}
info

For in-depth details on the atlas schema inspect command, covering aspects like inspecting specific schemas, handling multiple schemas concurrently, excluding tables, and more, refer to our documentation here.

Declarative Migrations

Atlas supports a workflow called declarative schema migrations. In this workflow, you first define the desired state of your database schema (in one of many supported formats and languages). Then, Atlas calculates the diff between the desired state and the actual state of your database, and generates the SQL commands that will bring your database to the desired state.

Let's see this in action.

First, create a new file named schema.sql. This file will contain the desired state of our database in plain SQL.

schema.sql
-- Create "users" table
CREATE TABLE IF NOT EXISTS users (
id BIGINT NOT NULL,
email STRING,
display_name STRING,
created_at TIMESTAMP,
CONSTRAINT `pk_users` PRIMARY KEY (`id`)
);

-- Create "posts" table with foreign key relationship
CREATE TABLE `posts` (
`id` BIGINT NOT NULL,
`title` STRING,
`content` STRING,
`author_id` BIGINT,
`created_at` TIMESTAMP,
CONSTRAINT `fk_posts_author` FOREIGN KEY (`author_id`) REFERENCES `users` (`id`)
);

Applying our Schema

To apply our desired state to our Databricks database, we need to provide Atlas with two database connections:

  1. The target database connection (the database we want to apply these changes to)
  2. A dev database connection (a separate database used by Atlas to normalize our schema and generate the migration plan)

Let's create a new catalog and schema in our Databricks workspace to use as our dev database:

-- Create a new catalog named "dev_catalog"
CREATE CATALOG IF NOT EXISTS dev;

Next, we apply this schema to our database using the atlas schema apply command.

atlas schema apply \
-u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE" \
--to file://schema.sql --tx-mode="none" \
--dev-url "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE?catalog=dev"

Atlas will connect to our target database to inspect its current state. Next, it will use the dev database to normalize our schema and generate the SQL commands that will bring our database to the desired state:

Planning migration statements (2 in total):

-- add primary key "pk_users" to "users" table:
-> ALTER TABLE `default`.`users` ADD CONSTRAINT `pk_users` PRIMARY KEY (`id`);
-- create "posts" table:
-> CREATE TABLE IF NOT EXISTS `default`.`posts` (
`id` BIGINT NOT NULL,
`title` STRING,
`content` STRING,
`author_id` BIGINT,
`created_at` TIMESTAMP,
CONSTRAINT `fk_posts_author` FOREIGN KEY (`author_id`) REFERENCES `default`.`users` (`id`)
) TBLPROPERTIES ('delta.checkpoint.writeStatsAsJson' = 'false', 'delta.checkpoint.writeStatsAsStruct' = 'true', 'delta.enableDeletionVectors' = 'true', 'delta.feature.appendOnly' = 'supported', 'delta.feature.deletionVectors' = 'supported', 'delta.feature.invariants' = 'supported', 'delta.minReaderVersion' = '3', 'delta.minWriterVersion' = '7', 'delta.parquet.compression.codec' = 'zstd');

-------------------------------------------

Analyzing planned statements (2 in total):

-- no diagnostics found

-------------------------
-- 12.222772167s
-- 2 schema changes

-------------------------------------------

? Approve or abort the plan:
▸ Approve and apply
Abort

After applying the schema, Atlas confirms that the changes were applied:

Applying approved migration (2 statements in total):

-- add primary key "pk_users" to "users" table
-> ALTER TABLE `default`.`users` ADD CONSTRAINT `pk_users` PRIMARY KEY (`id`);
-- ok (730.959ms)

-- create "posts" table
-> CREATE TABLE IF NOT EXISTS `default`.`posts` (
`id` BIGINT NOT NULL,
`title` STRING,
`content` STRING,
`author_id` BIGINT,
`created_at` TIMESTAMP,
CONSTRAINT `fk_posts_author` FOREIGN KEY (`author_id`) REFERENCES `default`.`users` (`id`)
) TBLPROPERTIES ('delta.checkpoint.writeStatsAsJson' = 'false', 'delta.checkpoint.writeStatsAsStruct' = 'true', 'delta.enableDeletionVectors' = 'true', 'delta.feature.appendOnly' = 'supported', 'delta.feature.deletionVectors' = 'supported', 'delta.feature.invariants' = 'supported', 'delta.minReaderVersion' = '3', 'delta.minWriterVersion' = '7', 'delta.parquet.compression.codec' = 'zstd');
-- ok (1.638282292s)

-------------------------
-- 2.369365125s
-- 1 migration
-- 2 sql statements

Next, let's re-run the atlas schema apply command. This time, Atlas will detect that the database is already in the desired state and will not generate any changes:

Schema is synced, no changes to be made

Altering our Schema

Now, let's make some changes to our schema. Open the schema.sql file and add a new column to the users table:

schema.sql
  id BIGINT NOT NULL,
email STRING,
+ bio STRING, -- new column
display_name STRING,
created_at TIMESTAMP,
);

Next, let's re-run the atlas schema apply command. This time, Atlas will detect that the schema has changed and will generate the needed SQL commands to bring the database to the desired state:

Planning migration statements (1 in total):

-- add column "bio" to "users" table:
-> ALTER TABLE `default`.`users` ADD COLUMN `bio` STRING;

-------------------------------------------

Analyzing planned statements (1 in total):

-- no diagnostics found

-------------------------
-- 6.236967833s
-- 1 schema change

-------------------------------------------

? Approve or abort the plan:
▸ Approve and apply
Abort

After applying the changes, Atlas confirms once again that the changes were applied:

Applying approved migration (1 statement in total):

-- add column "bio" to "users" table
-> ALTER TABLE `default`.`users` ADD COLUMN `bio` STRING;
-- ok (1.03650425s)

-------------------------
-- 1.036546542s
-- 1 migration
-- 1 sql statement

Visualizing our Schema

One of the most useful features of Atlas is the ability to visualize your database schema. To do so, run the atlas schema inspect command with the -w (web) flag:

atlas schema inspect -w -u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE"

Atlas will ask whether you would like to create your visualization publicly (in a publicly accessible URL) or privately (in your Atlas Cloud account):

? Where would you like to share your schema visualization?:
▸ Publicly (gh.atlasgo.cloud)
Privately (your-username.atlasgo.cloud)

For this demo, let's choose the public option. Atlas will create the visualization and open it in your default browser, showing your Databricks schema with tables and their relationships.

See it for yourself at: https://gh.atlasgo.cloud/explore/5116ede7

Wrapping Up

In this guide, we demonstrated how to set up Atlas to manage your Databricks database schema using Unity Catalog. We also demonstrated how to use some of Atlas's basic capabilities, such as declarative schema migrations and schema visualization, with a Databricks database. These two features are just the tip of the iceberg. Atlas has many more features that can help you better manage your database! To learn more, check out the Atlas documentation.

As always, we would love to hear your feedback and suggestions on our Discord server.