Configuration Files in dbt
As mentioned in the previous article, YAML is one of the programming languages that makes dbt powerful. It is used for for a variety of purpose such as for configuring, documenting and testing our data transformation. So what are the things that YAML can configure in dbt?
To answer this question let me first show you the recommended structure a dbt project. From the structure below, you can see some YAML files, either located in the root directory or inside models directory.
Root Directory
As you can see from the project structure (above), there are 2 YAML files in the root directory, dbt_project.yml
and packages.yml
.
dbt_project.yml
a
dbt_project.yml
file is how dbt knows a directory is a dbt project. It also contains important information that tells dbt how to operate on your project. — dbt
When you initialize a dbt project, a dbt_project.yml
file will be automatically created for you in the root directory of your project. The dbt_project.yml
is a configuration file where you can specify your project level details such as the paths where various components of your project will be found and model materialization (view, table, etc).
The template below shows the basic components of a dbt_project.yml
file with default configurations. You can make changes in this file, for example if you want your models
folder to be named as transform
, then you can simply rename your models
folder to transforms
and in your dbt_project.yml
you need to specify the model-paths
now (inside a list) to transforms
instead of models
. As a note: you can also configure how your models will be materialized here(view, table, etc) but these settings can also be overridden in the individual model files. You can check here on other components that dbt_project.yml
has.
You might also want to look at the name of your project and profile configuration that you are going to use. We will discuss more on profile
in the later section
packages.yml
packages.yml
file should be at the same level/location as your dbt_project.yml
file. This file contains instructions for dbt to install packages or libraries. Packages are dbt projects that can be installed and added to your dbt project. You can specify the packages that you want to install under packages
key. You can check what the available packaged that you can install on dbt hub
As a dbt user, by adding a package to your project, the package’s models and macros will become part of your own project. — dbt
Models Directory
In dbt, all of our SQL transformation files are located inside models
folder. The models
folder may contain several subfolders. We do this so that we can structure our project, logic and data transformation process better. As an example, inside models
directory we can have; a staging
folder for cleaning data from its source and a marts
folder for reporting layer.
dbt recommends each models
directory to have a separate config (YAML) file. Each subfolder should have a dedicated YAML file that supports each data transformation (SQL file) process, for example adding testing and documentation in the process. In general, there are 2 types of YAML file in models directory; each that ends with sources (__sources.yml
) and models (__models.yml
) — these conventions are recommended by dbt.
__sources.yml
In dbt, sources are defined as the copies of the raw data inside your data warehouse that are yet to be cleansed/edited/transformed. We use __sources.yml
to not just bringing in the raw data (from an underlying data warehouse such as BigQuery, Postgres etc) into dbt, but also naming, describing and testing it. As __sources.yml
configures and works with the raw data itself, this YAML file should be put inside staging
folder.
You can specify your sources under sources
block along with your database
, schema
and table
name in your __sources.yml
file. You can also give an alias to your data source using name
key. This will allow you to easily call the data source inside your SQL transformation files using {{ source() }}
function.
In brief, there are 4 things that __sources.yml
could do:
Import your raw data into dbt under
sources
block.Describe/document your sources under
description
key. You can use multiline strings here on how you want the description to be printed (in a single line or multiple lines). You can later view this description/documentation in an IDE by runningdbt docs generate
.Perform data quality tests on your sources under
tests
block.Calculate the freshness of your source data under
freshness
block.loaded_at_field
should be provided to enable dbt to calculate the freshness of your tables.
__models.yml
__models.yml
pretty much serves the same function as __sources.yml
. __models.yml
should be located inside intermediate
and mart
folder. You can use __models.yml
to do some of these things:
Give your data models a proper documentation.
Add data quality tests in your data transformation process under
test
block.Set metadata for a resource, which compiled into the
manifest.json
file and is viewable in the auto-generated documentation.
Other YAML Files
profile.yml
In dbt, profile.yml
is a must have (along with dbt_project.yml
) in every dbt project that you have. profile.yml
holds and defines all the connection details to your data warehouse. It can define:
Different connection details for different data warehouses.
Multiple targets schemas such as schema for production and development environment.
Since profile.yml
holds sensitive information (credentials for database connection) we can’t find it anywhere inside our dbt project directory. It is stored in ~/.dbt/
directory.
_macros.yml
Macro is similar with function in other programming languages such as Python and JavaScript. It lets you write a reusable code (DRY — Don’t Repeat Yourself). You can create and define macros in sql files.
And just like in other programming languages, in dbt we can also document what a macro does in a YAML file. The convention for naming the file is <whatever_you_want>_macros.yml
. Here is an example:
And that concludes what I wanted to share with you. I hope this article is useful in your journey to become better dbt user and analytics engineer. If you have any feedback, please write them in the comment section. Let’s get connected and thank you for your time.