On DBT's CI/CD Workflow

Last time we talk about some of the basic dbt functionalities, in this blog post, I want to talk about how we can use github workflows to automate dbt.

For a lot of dbt-core users who are not using the dbt cloud version, integrating dbt with github workflows is an alternative to automate dbt models & tests run, also for hosting the documentation page.

In this post, we will look at automating the models build and testing process by GitHub Actions, also using the github pages action to host the dbt documentation page on github.

What is github workflow?

According to the official github website, GitHub Actions is a continuous integration and continuous delivery (CI/CD) platform that allows you to automate your build, test, and deployment pipeline. You can create workflows that build and test every pull request to your repository, or deploy merged pull requests to production.

Set up dbt with github workflow

Assuming our repo looks something like this, a snowflake dev warehouse with bronze, silver and gold schema.

├── README.md
├── bronze_schema
│   ├── analyses
│   ├── dbt_packages
│   ├── dbt_project.yml
│   ├── logs
│   ├── macros
│   ├── models
│   ├── packages.yml
│   ├── seeds
│   ├── snapshots
│   ├── target
│   └── tests
├── silver_schema
│   ├── analyses
│   ├── dbt_packages
│   ├── dbt_project.yml
│   ├── logs
│   ├── macros
│   ├── models
│   ├── packages.yml
│   ├── seeds
│   ├── snapshots
│   ├── target
│   └── tests
└── gold_schema
    ├── analyses
    ├── dbt_packages
    ├── dbt_project.yml
    ├── logs
    ├── macros
    ├── models
    ├── packages.yml
    ├── seeds
    ├── snapshots
    ├── target
    └── tests

Job Automation

1 - Set up workflow yml file

On you github dbt project repo, create a new folder .github/workflows/job-automation.yml. Alternatively, you can also directly click on Actions > New actions, to import the workflow yml.

Under job-automation.yml, we will have to add below, the workflow is now set to run either on-demand manual trigger or daily at 7 am.

name: job-automation

on:
  workflow_dispatch:
  schedule:
     - cron: 0 7 * * *
env:
  DBT_PROFILES_DIR: ./
  
  DBT_SNOWFLAKE_ACCOUNT: ${{ secrets.DBT_SNOWFLAKE_ACCOUNT }}
  DBT_SNOWFLAKE_USERNAME: ${{ secrets.DBT_SNOWFLAKE_USERNAME }}
  DBT_SNOWFLAKE_PW: ${{ secrets.DBT_SNOWFLAKE_PW }}
  DBT_SNOWFLAKE_ROLE: ${{ secrets.DBT_SNOWFLAKE_ROLE }}
  DBT_SNOWFLAKE_DATABASE: ${{ secrets.DBT_SNOWFLAKE_DATABASE }}
  DBT_SNOWFLAKE_WAREHOUSE: ${{ secrets.DBT_SNOWFLAKE_WAREHOUSE }}
  DBT_SNOWFLAKE_BRONZE_SCHEMA: bronze_schema
  DBT_SNOWFLAKE_SILVER_SCHEMA: silver_schema
  DBT_SNOWFLAKE_GOLD_SCHEMA: gold_schema


jobs:
  job-automation:
   runs-on: self-hosted
   steps:
     - name: Check out
       uses: actions/checkout@v3

     - name: Set up Python
       uses: actions/setup-python@v4
       with:
         python-version: "3.9.6"
         
  # pip install dbt
     - name: Install dependencies
       run: |
         pip install dbt-core==1.3.1
         pip install dbt-snowflake==1.3.0
         dbt deps --project-dir bronze_schema
         dbt deps --project-dir silver_schema
         dbt deps --project-dir gold_schema         

   # dbt related commands - 
     - name: Run dbt models
       run: dbt run --project-dir bronze_schema --models folder.subfolder.model_name

     - name: Test dbt models
       run: dbt test --project-dir gold_schema --select folder.subfolder.model_name

2 - Set up secrets/ env var for job

Navigate to Settings > Security > Actions, add all your secrets and/ or env variables for the job. In our case this time, we will be setting the repository’s secret.

png

3 - Set up profiles.yml On the root of the dbt github repo, create a profiles.yml

It will look somthing like this

bronze_schema:
  outputs:
    dev:
      account: "{{ env_var('DBT_SNOWFLAKE_ACCOUNT') }}"
      database: "{{ env_var('DBT_SNOWFLAKE_DATABASE') }}"
      password: "{{ env_var('DBT_SNOWFLAKE_PW') }}"
      role: "{{ env_var('DBT_SNOWFLAKE_ROLE') }}"
      schema: BRONZE_SCHEMA
      threads: 1
      type: snowflake
      user: "{{ env_var('DBT_SNOWFLAKE_USERNAME') }}"
      warehouse: "{{ env_var('DBT_SNOWFLAKE_WAREHOUSE') }}"
  target: dev
silver_schema:
  outputs:
    dev:
      account: "{{ env_var('DBT_SNOWFLAKE_ACCOUNT') }}"
      database: "{{ env_var('DBT_SNOWFLAKE_DATABASE') }}"
      password: "{{ env_var('DBT_SNOWFLAKE_PW') }}"
      role: "{{ env_var('DBT_SNOWFLAKE_ROLE') }}"
      schema: SILVER_SCHEMA
      threads: 1
      type: snowflake
      user: "{{ env_var('DBT_SNOWFLAKE_USERNAME') }}"
      warehouse: "{{ env_var('DBT_SNOWFLAKE_WAREHOUSE') }}"
  target: dev
gold_schema:
  outputs:
    dev:
      account: "{{ env_var('DBT_SNOWFLAKE_ACCOUNT') }}"
      database: "{{ env_var('DBT_SNOWFLAKE_DATABASE') }}"
      password: "{{ env_var('DBT_SNOWFLAKE_PW') }}"
      role: "{{ env_var('DBT_SNOWFLAKE_ROLE') }}"
      schema: GOLD_SCHEMA
      threads: 1
      type: snowflake
      user: "{{ env_var('DBT_SNOWFLAKE_USERNAME') }}"
      warehouse: "{{ env_var('DBT_SNOWFLAKE_WAREHOUSE') }}"
  target: dev

And with the above set up, we now are able to automate our dbt models.

As an extra, we can also configure the workflow so that if there are any models failed, the workflow also send a notification message to a channel like emails, slack, telegram, webex etc.

The example below shows a task that send a notification message to a webex space when there are failed dbt model during the build process.

name: job-automation

on:
  workflow_dispatch:
  schedule:
     - cron: 0 7 * * *
env:
  DBT_PROFILES_DIR: ./
  
  DBT_SNOWFLAKE_ACCOUNT: ${{ secrets.DBT_SNOWFLAKE_ACCOUNT }}
  DBT_SNOWFLAKE_USERNAME: ${{ secrets.DBT_SNOWFLAKE_USERNAME }}
  DBT_SNOWFLAKE_PW: ${{ secrets.DBT_SNOWFLAKE_PW }}
  DBT_SNOWFLAKE_ROLE: ${{ secrets.DBT_SNOWFLAKE_ROLE }}
  DBT_SNOWFLAKE_DATABASE: ${{ secrets.DBT_SNOWFLAKE_DATABASE }}
  DBT_SNOWFLAKE_WAREHOUSE: ${{ secrets.DBT_SNOWFLAKE_WAREHOUSE }}
  DBT_SNOWFLAKE_BRONZE_SCHEMA: bronze_schema
  DBT_SNOWFLAKE_SILVER_SCHEMA: silver_schema
  DBT_SNOWFLAKE_GOLD_SCHEMA: gold_schema


jobs:
  job-automation:
   runs-on: self-hosted
   steps:
     - name: Check out
       uses: actions/checkout@v3

     - name: Set up Python
       uses: actions/setup-python@v4
       with:
         python-version: "3.10"
         
  # pip install dbt
     - name: Install dependencies
       run: |
         pip install dbt-core==1.3.1
         pip install dbt-snowflake==1.3.0
         dbt deps --project-dir bronze_schema
         dbt deps --project-dir silver_schema
         dbt deps --project-dir gold_schema         

     - name: Build dbt models and check for failures
        id: dbt_run
        run: |
          declare -A SCHEMA_MODELS
          SCHEMA_MODELS[bronze_schema]="project.model project.model2 project.model3"
          SCHEMA_MODELS[silver_schema]="project.model project.model2"
          SCHEMA_MODELS[gold_schema]="project.model"

          FAILED_MODELS=""
          for PROJECT_DIR in "${!SCHEMA_MODELS[@]}"; do
            MODELS=(${SCHEMA_MODELS[$PROJECT_DIR]})
            for MODEL in "${MODELS[@]}"; do
              if ! dbt run --project-dir $PROJECT_DIR --models $MODEL; then
                FAILED_MODELS+="${PROJECT_DIR^^}.${MODEL^^} \\n"
              fi
            done
          done
          if [ -n "$FAILED_MODELS" ]; then
            echo "FAILED_MODELS=$FAILED_MODELS" >> $GITHUB_ENV
          fi          
        continue-on-error: true

      - name: Notify Webex with dbt build result
        if: env.FAILED_MODELS != ''
        run: |
          curl --request POST \
          --url 'https://webexapis.com/v1/messages' \
          --header 'Authorization: Bearer ${{ secrets.WEBEX_BOT_TOKEN }}' \
          --header 'Content-Type: application/json' \
          --data "{
            \"roomId\": \"${{ secrets.WEBEX_NOTI_ROOM_ID }}\",
            \"markdown\": \"The GitHub Actions DBT build workflow failed with the following DBT model(s):\\n\\n$FAILED_MODELS\"
          }"          
      
      - name: Run dbt tests and check for failures
        id: dbt_test
        run: |
          declare -A SCHEMA_TESTS
          SCHEMA_MODELS[bronze_schema]="project.model project.model2 project.model3"
          SCHEMA_MODELS[silver_schema]="project.model project.model2"
          SCHEMA_MODELS[gold_schema]="project.model"

          FAILED_TESTS=""
          for PROJECT_DIR in "${!SCHEMA_TESTS[@]}"; do
            TESTS=(${SCHEMA_TESTS[$PROJECT_DIR]})
            for TEST in "${TESTS[@]}"; do
              if ! dbt test --project-dir $PROJECT_DIR --select $TEST; then
                FAILED_TESTS+="${PROJECT_DIR^^}.${TEST^^} \\n"
              fi
            done
          done
          if [ -n "$FAILED_TESTS" ]; then
            echo "FAILED_TESTS=$FAILED_TESTS" >> $GITHUB_ENV
          fi          
        continue-on-error: true

      - name: Notify Webex with dbt test result
        if: env.FAILED_TESTS != ''
        run: |
          curl --request POST \
          --url 'https://webexapis.com/v1/messages' \
          --header 'Authorization: Bearer ${{ secrets.WEBEX_BOT_TOKEN }}' \
          --header 'Content-Type: application/json' \
          --data "{
            \"roomId\": \"${{ secrets.WEBEX_NOTI_ROOM_ID }}\",
            \"markdown\": \"The dbt test step has failed in the GitHub Actions workflow with the following DBT test(s):\\n\\n$FAILED_TESTS\"
          }"

Hosting Documentaion Page on GitHub

Next, let’s work on building a workflow that build a gituhb page to host the dbt documentation for all bronze, silver and gold schemas.

Firstly, we will have to set the pages setting of the repo under settings > pages

png

After that, under the folder .github/workflows dir, we can open a new yml workflow file looks like below:

name: docs

on:
  workflow_dispatch:
  push:
    branches:
      - main

env:
  DBT_PROFILES_DIR: ./
  
  DBT_SNOWFLAKE_ACCOUNT: ${{ secrets.DBT_SNOWFLAKE_ACCOUNT }}
  DBT_SNOWFLAKE_USERNAME: ${{ secrets.DBT_SNOWFLAKE_USERNAME }}
  DBT_SNOWFLAKE_PW: ${{ secrets.DBT_SNOWFLAKE_PW }}
  DBT_SNOWFLAKE_ROLE: ${{ secrets.DBT_SNOWFLAKE_ROLE }}
  DBT_SNOWFLAKE_DATABASE: ${{ secrets.DBT_SNOWFLAKE_DATABASE }}
  DBT_SNOWFLAKE_WAREHOUSE: ${{ secrets.DBT_SNOWFLAKE_WAREHOUSE }}
  DBT_SNOWFLAKE_BRONZE_SCHEMA: bronze_schema
  DBT_SNOWFLAKE_SILVER_SCHEMA: silver_schema
  DBT_SNOWFLAKE_GOLD_SCHEMA: gold_schema


jobs:
  docs:
   name: docs
   runs-on: self-hosted
   steps:
     - name: Check out
       uses: actions/checkout@v3

     - name: Set up Python
       uses: actions/setup-python@v4
       with:
         python-version: "3.10"
         
  # pip install dbt
     - name: Install dependencies
       run: |
         pip install dbt-core==1.3.1
         pip install dbt-snowflake==1.3.0
         dbt deps --project-dir bronze_schema
         dbt deps --project-dir silver_schema
         dbt deps --project-dir gold_schema         
    
     - name: build single html
        run: |
          dbt docs generate --project-dir bronze_schema
          dbt docs generate --project-dir silver_schema
          dbt docs generate --project-dir gold_schema
          python dbt-docs.py          

      - name: Deploy to GitHub Pages
        if: success()
        uses: crazy-max/ghaction-github-pages@v3
        with:
          target_branch: gh-pages
          build_dir: target
        env:
          GITHUB_TOKEN: ${{ secrets.GH_TOKEN }}

Then under the root dir ./ we need to have a python file called dbt-docs.py What this scirpt essentially doing is to merge the front end manifests of each schema to a consolidated file so that the page can be rendered properly.

import json
import os
import shutil
from pathlib import Path
from functools import reduce


class SchemaMerger:
    def __init__(self, schemas):
        self.schemas = schemas
        self.base_dir = Path(os.getcwd())
        self.new_target_dir = self.base_dir / "target"
        self.search_str = (
            'o=[i("manifest","manifest.json"+t),i("catalog","catalog.json"+t)]'
        )
        self.combined_manifest = {
            "nodes": {},
            "sources": {},
            "exposures": {},
            "metrics": {},
            "selectors": {},
            "tests": {},
            "child_map": {},
            "docs": {},
            "disabled": {},
        }
        self.combined_catalog = {
            "nodes": {},
            "sources": {},
            "exposures": {},
            "metrics": {},
        }

    @staticmethod
    def merge_dicts(a, b):
        for key, value in b.items():
            if isinstance(value, dict):
                if key in a and isinstance(a[key], dict):
                    SchemaMerger.merge_dicts(a[key], value)
                else:
                    a[key] = value
            else:
                a[key] = value

    def load_and_merge_json(self, file_path, combined):
        with open(file_path, "r") as f:
            data = json.load(f)
            self.merge_dicts(combined, data)
        return combined

    @staticmethod
    def replace_content_in_file(file_path, search_str, new_str):
        with open(file_path, "r") as f:
            content = f.read()
        new_content = content.replace(search_str, new_str)
        with open(file_path, "w") as f:
            f.write(new_content)

    def run(self):
        self.new_target_dir.mkdir(exist_ok=True)
        shutil.copy(
            self.base_dir / self.schemas[0] / "target" / "index.html",
            self.new_target_dir / "index.html",
        )

        self.combined_manifest = reduce(
            lambda combined, schema: self.load_and_merge_json(
                self.base_dir / schema / "target" / "manifest.json", combined
            ),
            self.schemas,
            self.combined_manifest,
        )
        self.combined_catalog = reduce(
            lambda combined, schema: self.load_and_merge_json(
                self.base_dir / schema / "target" / "catalog.json", combined
            ),
            self.schemas,
            self.combined_catalog,
        )

        with open(self.new_target_dir / "manifest.json", "w") as f:
            json.dump(self.combined_manifest, f)
        with open(self.new_target_dir / "catalog.json", "w") as f:
            json.dump(self.combined_catalog, f)

        new_str = (
            "o=[{label: 'manifest', data: "
            + json.dumps(self.combined_manifest)
            + "},{label: 'catalog', data: "
            + json.dumps(self.combined_catalog)
            + "}]"
        )
        self.replace_content_in_file(
            self.new_target_dir / "index.html", self.search_str, new_str
        )


sf_schemas = ["bronze_schema", "silver_schema", "gold_schema"]
merger = SchemaMerger(sf_schemas)
merger.run()

And that’s it, when we now merge code to the main branch, the dbt documetation page will re-build and deploy automatically.

Some final thoughts…

This is just one use case; there’s so much more GitHub Actions can do, such as ensuring code quality and security on branches based on git push, automating deployment on merging branches, etc.

That’s it for now.

Thank you for reading, and have a nice day!

On DBT's CI/CD Workflow

More articles from Yu Wong

On DBT

On Google Cloud Functions