YAML Files in dbt — Part 2/2

Configuration Files in dbt

Dec 20, 2022

As mentioned in the previous article, YAML is one of the programming languages that makes dbt powerful. It is used for for a variety of purpose such as for configuring, documenting and testing our data transformation. So what are the things that YAML can configure in dbt?

To answer this question let me first show you the recommended structure a dbt project. From the structure below, you can see some YAML files, either located in the root directory or inside models directory.

Root Directory

As you can see from the project structure (above), there are 2 YAML files in the root directory, dbt_project.yml and packages.yml.

dbt_project.yml

a dbt_project.yml file is how dbt knows a directory is a dbt project. It also contains important information that tells dbt how to operate on your project. — dbt

When you initialize a dbt project, a dbt_project.yml file will be automatically created for you in the root directory of your project. The dbt_project.yml is a configuration file where you can specify your project level details such as the paths where various components of your project will be found and model materialization (view, table, etc).

The template below shows the basic components of a dbt_project.yml file with default configurations. You can make changes in this file, for example if you want your models folder to be named as transform, then you can simply rename your models folder to transforms and in your dbt_project.yml you need to specify the model-paths now (inside a list) to transforms instead of models. As a note: you can also configure how your models will be materialized here(view, table, etc) but these settings can also be overridden in the individual model files. You can check here on other components that dbt_project.yml has.

You might also want to look at the name of your project and profile configuration that you are going to use. We will discuss more on profile in the later section

Basic components of dbt_project.yml file

packages.yml

packages.yml file should be at the same level/location as your dbt_project.yml file. This file contains instructions for dbt to install packages or libraries. Packages are dbt projects that can be installed and added to your dbt project. You can specify the packages that you want to install under packages key. You can check what the available packaged that you can install on dbt hub

As a dbt user, by adding a package to your project, the package’s models and macros will become part of your own project. — dbt

An example of installing dbt-utils and dbt-expectations packages, defined in your packages.yml.

Models Directory

In dbt, all of our SQL transformation files are located inside models folder. The models folder may contain several subfolders. We do this so that we can structure our project, logic and data transformation process better. As an example, inside models directory we can have; a staging folder for cleaning data from its source and a marts folder for reporting layer.

dbt recommends each models directory to have a separate config (YAML) file. Each subfolder should have a dedicated YAML file that supports each data transformation (SQL file) process, for example adding testing and documentation in the process. In general, there are 2 types of YAML file in models directory; each that ends with sources (__sources.yml) and models (__models.yml) — these conventions are recommended by dbt.

__sources.yml

In dbt, sources are defined as the copies of the raw data inside your data warehouse that are yet to be cleansed/edited/transformed. We use __sources.yml to not just bringing in the raw data (from an underlying data warehouse such as BigQuery, Postgres etc) into dbt, but also naming, describing and testing it. As __sources.yml configures and works with the raw data itself, this YAML file should be put inside staging folder.

You can specify your sources under sources block along with your database , schema and table name in your __sources.yml file. You can also give an alias to your data source using name key. This will allow you to easily call the data source inside your SQL transformation files using {{ source() }} function.

In brief, there are 4 things that __sources.yml could do:

Import your raw data into dbt under sources block.
Describe/document your sources under description key. You can use multiline strings here on how you want the description to be printed (in a single line or multiple lines). You can later view this description/documentation in an IDE by running dbt docs generate.
Perform data quality tests on your sources under tests block.
Calculate the freshness of your source data under freshness block. loaded_at_field should be provided to enable dbt to calculate the freshness of your tables.

__models.yml

__models.yml pretty much serves the same function as __sources.yml. __models.yml should be located inside intermediate and mart folder. You can use __models.yml to do some of these things:

Give your data models a proper documentation.
Add data quality tests in your data transformation process under test block.
Set metadata for a resource, which compiled into the manifest.json file and is viewable in the auto-generated documentation.

Other YAML Files

profile.yml

In dbt, profile.yml is a must have (along with dbt_project.yml) in every dbt project that you have. profile.yml holds and defines all the connection details to your data warehouse. It can define:

Different connection details for different data warehouses.
Multiple targets schemas such as schema for production and development environment.

Since profile.yml holds sensitive information (credentials for database connection) we can’t find it anywhere inside our dbt project directory. It is stored in ~/.dbt/ directory.

An example of profile.yml structure with BigQuery data warehouse

_macros.yml

Macro is similar with function in other programming languages such as Python and JavaScript. It lets you write a reusable code (DRY — Don’t Repeat Yourself). You can create and define macros in sql files.

And just like in other programming languages, in dbt we can also document what a macro does in a YAML file. The convention for naming the file is <whatever_you_want>_macros.yml. Here is an example:

And that concludes what I wanted to share with you. I hope this article is useful in your journey to become better dbt user and analytics engineer. If you have any feedback, please write them in the comment section. Let’s get connected and thank you for your time.

Journey with Data

Discussion about this post