Automatic Databricks Schema Migrations with Atlas
Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on a single platform. Built on Apache Spark, it provides a collaborative workspace for processing large-scale data workloads with high performance and reliability.
However, managing database schemas in Databricks can be challenging, especially when working with Unity Catalog's
three-level namespace (catalog.schema.table
) and coordinating schema changes across multiple teams and workspaces.
Enter: Atlas
Atlas helps developers manage their database schema as code, abstracting away the intricacies of database schema management. With Atlas, users provide the desired state of the database schema, and Atlas automatically plans the required migrations.
In this guide, we will dive into setting up Atlas for Databricks schema migration using the declarative workflow.
The Databricks Driver is currently in beta and should be used with caution in production environments.
Prerequisites
- Databricks workspace and SQL warehouse:
- Access to a Databricks workspace
- A running SQL warehouse or compute cluster
- Personal Access Token (PAT) for authentication
- Atlas installed on your machine:
- macOS + Linux
- Docker
- Windows
To download and install the beta release of the Atlas CLI, simply run the following in your terminal:
curl -sSf https://atlasgo.sh | ATLAS_VERSION="beta" sh
To pull the Atlas image and run it as a Docker container:
docker pull arigaio/atlas-beta
docker run --rm arigaio/atlas-beta --help
If the container needs access to the host network or a local directory, use the --net=host
flag and mount the desired
directory:
docker run --rm --net=host \
-v $(pwd)/migrations:/migrations \
arigaio/atlas-beta migrate apply
--url "mysql://root:pass@:3306/test"
Download the beta release and move the atlas binary to a file location on your system PATH.
- An Atlas Pro account. To use the Databricks driver, run:
$ atlas login
Setting up Databricks Authentication
To connect Atlas to your Databricks workspace, you'll need to set up authentication using a Personal Access Token (PAT).
Creating a Personal Access Token
- Log in to your Databricks workspace
- Click on your username in the top-right corner and select "User Settings"
- Go to the "Developer" tab
- Click "Manage" next to "Access tokens"
- Click "Generate new token"
- Give your token a description and set an expiration (optional)
- Copy the generated token and store it securely
See the Databricks documentation for more details.
Environment Setup
To get DATABRICKS_HOST
and DATABRICKS_WAREHOUSE
, refer to the Databricks documentation
Set your Databricks credentials as environment variables:
export DATABRICKS_TOKEN="your-personal-access-token"
export DATABRICKS_HOST="dbc-xxxxxxxx-xxxx.cloud.databricks.com"
export DATABRICKS_WAREHOUSE="/sql/1.0/warehouses/your-warehouse-id"
Creating a Databricks Schema
Let's start by creating a schema in your Databricks workspace. You can do this through the Databricks UI or by running SQL commands in a notebook or SQL editor:
-- Create an initial table
CREATE TABLE IF NOT EXISTS users (
id BIGINT NOT NULL,
email STRING,
display_name STRING,
created_at TIMESTAMP
);
Inspecting the Schema
The atlas schema inspect
command supports reading the database description provided by a URL and outputting it in
different formats, including Atlas DDL (default), SQL, and JSON. We will
demonstrate the flow using both the Atlas DDL and SQL formats.
- Atlas DDL (HCL)
- SQL
To inspect our Databricks schema, use the -u
flag and write the output to a file named schema.hcl
:
atlas schema inspect -u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE" > schema.hcl
Open the schema.hcl
file to view the Atlas schema that describes our database.
table "users" {
schema = schema.default
column "id" {
null = false
type = BIGINT
}
column "email" {
null = true
type = STRING
}
column "display_name" {
null = true
type = STRING
}
column "created_at" {
null = true
type = TIMESTAMP
default = sql("current_timestamp()")
}
property "delta.checkpoint.writeStatsAsJson" {
value = "false"
}
property "delta.checkpoint.writeStatsAsStruct" {
value = "true"
}
property "delta.enableDeletionVectors" {
value = "true"
}
property "delta.feature.appendOnly" {
value = "supported"
}
property "delta.feature.deletionVectors" {
value = "supported"
}
property "delta.feature.invariants" {
value = "supported"
}
property "delta.minReaderVersion" {
value = "3"
}
property "delta.minWriterVersion" {
value = "7"
}
property "delta.parquet.compression.codec" {
value = "zstd"
}
}
schema "default" {
}
To inspect our Databricks schema, use the -u
flag and write the output to a file named schema.sql
:
atlas schema inspect -u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE" --format '{{ sql . }}' > schema.sql
Open the schema.sql
file to view the inspected SQL schema that describes our database.
-- Add new schema named "default"
CREATE SCHEMA `default`;
-- Create "users" table
CREATE TABLE `default`.`users` (
`id` BIGINT NOT NULL,
`email` STRING,
`display_name` STRING,
`created_at` TIMESTAMP DEFAULT current_timestamp()
) TBLPROPERTIES ('delta.checkpoint.writeStatsAsJson' = 'false', 'delta.checkpoint.writeStatsAsStruct' = 'true', 'delta.enableDeletionVectors' = 'true', 'delta.feature.allowColumnDefaults' = 'supported', 'delta.feature.appendOnly' = 'supported', 'delta.feature.deletionVectors' = 'supported', 'delta.feature.invariants' = 'supported', 'delta.minReaderVersion' = '3', 'delta.minWriterVersion' = '7', 'delta.parquet.compression.codec' = 'zstd');
For in-depth details on the atlas schema inspect
command, covering aspects like inspecting specific schemas,
handling multiple schemas concurrently, excluding tables, and more, refer to our documentation
here.
Declarative Migrations
Atlas supports a workflow called declarative schema migrations. In this workflow, you first define the desired state of your database schema (in one of many supported formats and languages). Then, Atlas calculates the diff between the desired state and the actual state of your database, and generates the SQL commands that will bring your database to the desired state.
Let's see this in action.
First, create a new file named schema.sql
. This file will contain the desired state of our database in plain SQL.
-- Create "users" table
CREATE TABLE IF NOT EXISTS users (
id BIGINT NOT NULL,
email STRING,
display_name STRING,
created_at TIMESTAMP,
CONSTRAINT `pk_users` PRIMARY KEY (`id`)
);
-- Create "posts" table with foreign key relationship
CREATE TABLE `posts` (
`id` BIGINT NOT NULL,
`title` STRING,
`content` STRING,
`author_id` BIGINT,
`created_at` TIMESTAMP,
CONSTRAINT `fk_posts_author` FOREIGN KEY (`author_id`) REFERENCES `users` (`id`)
);
Applying our Schema
To apply our desired state to our Databricks database, we need to provide Atlas with two database connections:
- The target database connection (the database we want to apply these changes to)
- A dev database connection (a separate database used by Atlas to normalize our schema and generate the migration plan)
Let's create a new catalog and schema in our Databricks workspace to use as our dev database:
-- Create a new catalog named "dev_catalog"
CREATE CATALOG IF NOT EXISTS dev;
Next, we apply this schema to our database using the atlas schema apply
command.
atlas schema apply \
-u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE" \
--to file://schema.sql --tx-mode="none" \
--dev-url "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE?catalog=dev"
Atlas will connect to our target database to inspect its current state. Next, it will use the dev database to normalize our schema and generate the SQL commands that will bring our database to the desired state:
Planning migration statements (2 in total):
-- add primary key "pk_users" to "users" table:
-> ALTER TABLE `default`.`users` ADD CONSTRAINT `pk_users` PRIMARY KEY (`id`);
-- create "posts" table:
-> CREATE TABLE IF NOT EXISTS `default`.`posts` (
`id` BIGINT NOT NULL,
`title` STRING,
`content` STRING,
`author_id` BIGINT,
`created_at` TIMESTAMP,
CONSTRAINT `fk_posts_author` FOREIGN KEY (`author_id`) REFERENCES `default`.`users` (`id`)
) TBLPROPERTIES ('delta.checkpoint.writeStatsAsJson' = 'false', 'delta.checkpoint.writeStatsAsStruct' = 'true', 'delta.enableDeletionVectors' = 'true', 'delta.feature.appendOnly' = 'supported', 'delta.feature.deletionVectors' = 'supported', 'delta.feature.invariants' = 'supported', 'delta.minReaderVersion' = '3', 'delta.minWriterVersion' = '7', 'delta.parquet.compression.codec' = 'zstd');
-------------------------------------------
Analyzing planned statements (2 in total):
-- no diagnostics found
-------------------------
-- 12.222772167s
-- 2 schema changes
-------------------------------------------
? Approve or abort the plan:
▸ Approve and apply
Abort
After applying the schema, Atlas confirms that the changes were applied:
Applying approved migration (2 statements in total):
-- add primary key "pk_users" to "users" table
-> ALTER TABLE `default`.`users` ADD CONSTRAINT `pk_users` PRIMARY KEY (`id`);
-- ok (730.959ms)
-- create "posts" table
-> CREATE TABLE IF NOT EXISTS `default`.`posts` (
`id` BIGINT NOT NULL,
`title` STRING,
`content` STRING,
`author_id` BIGINT,
`created_at` TIMESTAMP,
CONSTRAINT `fk_posts_author` FOREIGN KEY (`author_id`) REFERENCES `default`.`users` (`id`)
) TBLPROPERTIES ('delta.checkpoint.writeStatsAsJson' = 'false', 'delta.checkpoint.writeStatsAsStruct' = 'true', 'delta.enableDeletionVectors' = 'true', 'delta.feature.appendOnly' = 'supported', 'delta.feature.deletionVectors' = 'supported', 'delta.feature.invariants' = 'supported', 'delta.minReaderVersion' = '3', 'delta.minWriterVersion' = '7', 'delta.parquet.compression.codec' = 'zstd');
-- ok (1.638282292s)
-------------------------
-- 2.369365125s
-- 1 migration
-- 2 sql statements
Next, let's re-run the atlas schema apply
command. This time, Atlas will detect that the database is
already in the desired state and will not generate any changes:
Schema is synced, no changes to be made
Altering our Schema
Now, let's make some changes to our schema. Open the schema.sql
file and add a new column to the users
table:
id BIGINT NOT NULL,
email STRING,
+ bio STRING, -- new column
display_name STRING,
created_at TIMESTAMP,
);
Next, let's re-run the atlas schema apply
command. This time, Atlas will detect that the schema has changed
and will generate the needed SQL commands to bring the database to the desired state:
Planning migration statements (1 in total):
-- add column "bio" to "users" table:
-> ALTER TABLE `default`.`users` ADD COLUMN `bio` STRING;
-------------------------------------------
Analyzing planned statements (1 in total):
-- no diagnostics found
-------------------------
-- 6.236967833s
-- 1 schema change
-------------------------------------------
? Approve or abort the plan:
▸ Approve and apply
Abort
After applying the changes, Atlas confirms once again that the changes were applied:
Applying approved migration (1 statement in total):
-- add column "bio" to "users" table
-> ALTER TABLE `default`.`users` ADD COLUMN `bio` STRING;
-- ok (1.03650425s)
-------------------------
-- 1.036546542s
-- 1 migration
-- 1 sql statement
Visualizing our Schema
One of the most useful features of Atlas is the ability to visualize your database schema. To do so, run the
atlas schema inspect
command with the -w
(web) flag:
atlas schema inspect -w -u "databricks://$DATABRICKS_TOKEN@$DATABRICKS_HOST:443$DATABRICKS_WAREHOUSE"
Atlas will ask whether you would like to create your visualization publicly (in a publicly accessible URL) or privately (in your Atlas Cloud account):
? Where would you like to share your schema visualization?:
▸ Publicly (gh.atlasgo.cloud)
Privately (your-username.atlasgo.cloud)
For this demo, let's choose the public option. Atlas will create the visualization and open it in your default browser, showing your Databricks schema with tables and their relationships.
See it for yourself at: https://gh.atlasgo.cloud/explore/5116ede7
Wrapping Up
In this guide, we demonstrated how to set up Atlas to manage your Databricks database schema using Unity Catalog. We also demonstrated how to use some of Atlas's basic capabilities, such as declarative schema migrations and schema visualization, with a Databricks database. These two features are just the tip of the iceberg. Atlas has many more features that can help you better manage your database! To learn more, check out the Atlas documentation.
As always, we would love to hear your feedback and suggestions on our Discord server.