Select Page
Analytics Testing

An End-to-End DBT Tutorial for Testing

Have both Data Analysts and Data Engineers work in the same environment and transform the data like never before by reading our DBT Tutorial.

An End-to-End DBT Tutorial for Testing - Blog
Listen to this blog

Currently, we are extracting the data from different source systems, transforming it, and then finally loading it into the data warehouses. But when it comes to DBT, the raw source data from different source systems are extracted and directly loaded into the warehouse. After which, DBT allows us to create new materialized tables into which the transformed data is fed as it helps showcase the meaningful data for business analysis purposes by using simple select statements. But what makes DBT stand tall is it allows both the development and production environment teams to work in the same place by creating models & testing the scripts for transformation purposes. Being one of the leading QA companies, we will be mainly focusing on the testing part of the above process in this DBT Tutorial by exploring how we test our models.

Data flow of DBT Tutorial

What is DBT?

If you’re new to the DBT tool, worry not, we have got some basics in this DBT Tutorial for you as well. The Data Build Tool (DBT) is an open-source test automation tool and a command-line tool. It mainly focuses on the transformation part in the “Extract load and transform” pipeline. DBT allows both data analysts and data engineers to build the models and transform the data in their warehouses to make it meaningful.

New to DBT tool?

Let’s walk you through some of the basics to help you get a quick overview of DBT and the necessary steps that have to be done to create an account. Basically, there are two ways for DBT to be used:

1. DBT CLI(Command Line Interface):

The DBT CLI variant allows users to write their scripts and test them locally by using the DBT command-line tool.

2. DBT Cloud:

The DBT Cloud is a facilitated variant that smooths out an advancement with an online Integrated Development Environment, an interface to write our DBT test scripts and run them on a schedule.

In this DBT Tutorial, we will be covering the DBT cloud variant. It is worth noting that the ideas and practices can be stretched out to the CLI variant as well. So let’s get started.

Data warehouse:

As stated earlier, the DBT is used to handle the transformation part of the ELT for data warehousing. “But how does it work?” you might ask. Well, DBT helps us create a connection with the warehouse and then lets us write the SQL simple select statements against the warehouse to transform the data.

Supported Warehouses:

We have listed the names of the supported warehouses in DBT, and they are

• Postgres

• Redshift

• BigQuery

• Snowflake

• Apache Spark

• Databricks and

• Presto (Partially Supported).

We would be using the Snowflake warehouse in this blog for demonstration purposes.

DBT Tutorial for the Setup:

Now that we have seen an overview of DBT, let’s find out how to set up the DBT cloud and use its interface. If you’ve not yet created an account in DBT, you can sign up for DBT cloud by visiting their Sign up page.

Once you are done with your DBT cloud sign-up process, let’s find out how to set up your very first DBT project.

DBT recommends using GitHub to store our deployments. So the first step would be to create an empty repository in GitHub as shown below.

Setting up DBT tutorial

Enter your repository name and click on ‘Create repository’.

Once you have signed up with your DBT cloud, just select the snowflake warehouse and start configuring it using your snowflake credentials. We have shown a sample connection in the below images for your reference.

Set up Database connection - DBT tutorial

Development Credentials

After entering your warehouse and database details, make sure you have entered the snowflake user name and password in development credentials. Next, click on ‘Continue’ to get this confirmation message “Connection Test Success”. Now, you have to click on the ‘Continue’ button for configuring the GitHub. Since you’re done with GitHub configuration, we’re good to start with DBT.

DBT Tutorial for its Interface:

DBT has a lot of inbuilt options that help access and understand it easily. After setting up with DBT, the interface of DBT that we get to see initially will look like how it is shown in the image below.

DBT Interface

As you can see, DBT might initially look a bit empty and even show that the compilation error at the bottom right. So we would have to click on ‘Initialize your Project’ to start our first project. Doing this will make the DBT will provide us with a skeleton as shown in the below image for our project automatically.

Starting of DBT tutorial

We can clearly see that the compilation message that was displayed at the right bottom has now changed to become ‘ready’, implying that it is time to start our project. So let’s get started.

DBT Tutorial Database

First off, let’s find out the scope of some of the functions,

1. The commit button will handle all our git configurations like creating a new branch, pull request, commit, merge, etc.

2. After writing any SQL statements we can check the query then and there by using the dbt “preview data” and “compile SQL” buttons.

3. We can run and test our models by writing simple DBT commands (e.g. dbt debug, dbt run, dbt test).

4. DBT also generates documentation that will come in very handy for analytical purposes as it will contain all the details about the project. You can access it by clicking on the “view docs” option as shown in the above pic.

Note: Before running any model or a test, make sure that you have saved it.

Loading training data into the Warehouse:

So as stated earlier, DBT cloud currently supports only BigQuery, Postgres, Redshift & Snowflake warehouses, and we have decided to use Snowflake for the demo. So to illustrate how to import or load your training data into the warehouse, we will be loading the two tables that we have created.

Snowflake:

It is important to note that we have written the below instructions assuming that you have the database named compute_wh in your Snowflake account. So make sure that you have the required privileges to create the objects in this database.

Run the following commands in your Snowflake Warehouse SQL Runner:

To create the Database:

Create database STUDENT;
Create Schema STUDENT_INFO:
Use schema STUDENT_INFO;

To create the Tables:

Table 1: Student_Profile

Table 1 will contain the personal information of the students.

Create table student_profile (
  S_no int not null,
  Std_id int not null,
  Std_name varchar(45) null,
  Std_dep varchar(45) null,
  DOB datetime null,
  Primary key (Std_id));
  insert into student_profile (S_no, Std_id, Std_name, Std_dep, DOB) values ('1', '294', 'Afshad', 'CSE', '03/10/1998');
insert into student_profile (S_no, Std_id, Std_name, Std_dep, DOB) values ('2', '232', 'Sreekanth', 'Mech', '02/09/1997');
insert into student_profile (S_no, Std_id, Std_name, Std_dep, DOB) values ('3', '276', 'John', 'EEE', '10/06/1998');
insert into student_profile (S_no, Std_id, Std_name, Std_dep, DOB) values ('4', '303', 'Rahul', 'ECE', '12/05/1997');
insert into student_profile (S_no, Std_id, Std_name, Std_dep, DOB) values ('5', '309', 'Sam', 'Civil', '09/04/1999');
insert into student_profile (S_no, Std_id, Std_name, Std_dep, DOB) values ('6', '345', 'Ram', 'CA', '03/11/1998');
insert into student_profile (S_no, Std_id, Std_name, Std_dep, DOB) values ('7', '625', 'Priya', 'CSE', '03/12/1996');
insert into student_profile (S_no, Std_id, Std_name, Std_dep, DOB) values ('8', '739', 'Bob', 'MEC', '06/07/1998');
insert into student_profile (S_no, Std_id, Std_name, Std_dep, DOB) values ('9', '344', 'Ganesh', 'Mech', '07/09/2024');
insert into student_profile (S_no, Std_id, Std_name, Std_dep, DOB) values ('10', '123', 'Neha', 'ECE', '12/09/1998');
Table 2: STUDENT_RESULTS

Table 2 will contain the results of those students.

Create table student_results (
  S_no int not null,
  Std_id int not null,
  Marks int null,
  Result varchar(45) null,
  Primary key (Std_id));
insert into student_results (S_no, Std_id, Marks, Result) values ('1', '294', '78', 'Pass');
insert into student_results (S_no, Std_id, Marks, Result) values ('2', '232', '56', 'Pass');
insert into student_results (S_no, Std_id, Marks, Result) values ('3', '276', '88', 'Pass');
insert into student_results (S_no, Std_id, Marks, Result) values ('4', '303', '67', 'Pass');
insert into student_results (S_no, Std_id, Marks, Result) values ('5', '309', '38', 'Fail');
insert into student_results (S_no, Std_id, Marks, Result) values ('6', '345', '90', 'Pass');
insert into student_results (S_no, Std_id, Marks, Result) values ('7', '625', '87', 'Pass');
insert into student_results (S_no, Std_id, Marks, Result) values ('8', '739', '45', 'Fail');
insert into student_results (S_no, Std_id, Marks, Result) values ('9', '344', '97', 'Pass');
insert into student_results (S_no, Std_id, Marks, Result) values ('10', '123', '49', 'Fail');

Now that you have the training data in your warehouse, you are one step away from starting your DBT testing.

DBT Tutorial for the available Models:

In analytics, the process of modeling is changing the data from being the raw data to the final transformed data. Typically the data engineers are responsible for building tables that represents your source data, and on top of that, they also build the tables/views that transform the data step by step.

The models are just SQL select statements in your dbt project. These models are created inside the Models directory with the .sql extension.

What’s great about dbt is you don’t need to know DDL/DML to build the tables or views.

Note: The name of the file is used as the model or table name in your warehouse.

Creating Models:

Inside the model’s directory, we have created two simple models for better understanding.

Model's Directory - DBT Tutorial

Model 1:Student_profile.sql

{{ config(materialized='table') }}

with source_data as (
select 
    S_No,
    Std_Id,
    Std_Name,
    Std_Dep,
    DOB
 from 
STUDENT.STUDENT_INFO.STUDENT_PROFILE

)

select *
from source_data

Model 2: Student_Results.sql

{{ config(materialized='table') }}

with source_data as (
select 
    S_No,
    Std_Id,
    Marks,
    Result 
 from 
STUDENT.STUDENT_INFO.STUDENT_RESULTS

)

select *
from source_data

Once you’ve built your models, make sure to save and run the models to load them into the warehouse. You can use the ‘dbt run’ command to run your created models, or if you want to run any particular model, you can use the ‘dbt run –Student_Profile’ command.

Successful Completion of DBT function

DBT will show detailed information about the models when you run your models as shown in the above image. So we have successfully completed all the required steps and can move forward to find out how to perform testing in DBT.

DBT Tutorial to perform Testing:

We all know that testing plays a major role in deployment, so tests in analytics are just assertions that you have about your data. But what makes these assertions important is that if these assertions are met, it’ll be instrumental in helping others trust you by looking at the data.

The great part of DBT testing is that by default it provides four data quality checks. They are Unique, Not Null, Relationships (Referential Integrity), and Accepted Value checks.

Basically, DBT provides us with two types of testing. So let’s take a look at the steps to be done to perform the first type, the Schema test.

1. Schema Test:

Create a YAML file in the same model’s directory with a .yml Extension (Ex. Schema.yml).

Configure your tests (Unique, Not Null, Relationships, and Accepted Values) in the YAML file based on your table data and run it by using the command “dbt test”.

Note: Yaml is a data serialization and human-readable language that can be easily understood even by non-technical people. It is mainly used for configuration purposes as it supports all programming languages.

We have attached one YAML file for your reference below.

Schema.yml:

version: 2

models:
    - name: student_profile
      description: "Table contains all the information about the students"
      columns:
          - name: std_id
            description: "The primary key for this table"
            tests:
                - unique
                - not_null

    - name: student_results
      description: "Table contains the student results info"
      columns:
          - name: std_id
            description: "The primary key for this table"
            tests:
                - unique
                - not_null
                - relationships:
                    to: ref('student_profile')
                    field: std_id
          - name: result
            description: "Result of a student"
            tests:      
                - accepted_values:
                    values:
                      - pass
                      - fail

Once you run this test, you will get detailed information on all your tests. If you need to run only one particular schema test, you can use this command “dbt test –m Student_results”.

2. Data Test:

But if you are looking to write custom test scripts based on the requirements against the data, that also can be done by using tests of this type.

Start by creating the SQL files with your test name inside the test directory of the DBT project as shown below.

Test Directory

Now, add your test script to your test file.

Duplicate.sql:

select std_id, count(std_id) 
  from {{ref('student_profile')}} 
  group by std_id 
  having count(std_id)>2

referential_integrity.sql:

select std_id 
from {{ref('student_profile')}} 
where std_id not in (
    select std_id 
    from {{ref('student_results')}} )

Note:

Here ref function is used to refer one or more specific files into the script.

Finally, you can run the test by using the “dbt test –data” command.

Running the test by DBT command

DBT Tutorial for Documentation:

As discussed earlier, documentation is one of the greatest features of DBT. So once your project is completed, all you have to do is use the “dbt docs –generate” command to generate the document.

DBT Documentation

After running the command, just click on “view docs” at the top left of the DBT interface to view the complete information about your project in the form of a document. It even generates a lineage graph like the one shown below for a crystal clear understanding.

Lineage Graph

DBT also supports filters in the lineage graph, which is a great advantage when it comes to analysis.

Conclusion:

We hope that you have enjoyed reading this DBT Tutorial blog and also hope that DBT would be a great addition to your toolkit. To sum things up, DBT helps carry out the most complex and heavy transformations in a simple and easy manner by using simple select statements. It enables both data engineers and data analysts to work in the same environment and provides a unique experience for them to transform the data. The models in DBT enable faster execution and also make it so easy to test and modify. Finally, it even shows detailed information about the project by generating the document along with a lineage graph for clear analysis purposes. All thanks to such awesome features, DBT has been a resourceful tool that has helped us deliver the best software testing services to our clients.

Frequently Asked Questions

  • What is dbt tool used for?

    DBT (Data Build Tool) is used to perform the Transform part of the ETL (Extract Transform Load) process in a simplified manner by writing transformations as queries and orchestrating them effectively.

  • What is DBT testing?

    DBT testing is the process of ensuring the data you have is reliable by using assertions you have about the data as tests. There are 2 types of dbt testing; namely Schema Test and Data Test.

  • What are the four generic tests that dbt ships with?

    The 4 generic tests that dbt ships with are Unique, Not Null, Relationships (Referential Integrity), and Accepted Value checks.

  • What does dbt stand for data?

    DBT stands for Data Build Tool and as the name suggests, dbt can be used to transform data in warehouses in an effective manner.

  • Can I use dbt for free?

    Yes, you can use dbt Core for free as it is an open-source tool released under an Apache 2.0 License. But dbt Cloud is not a completely free tool as few features are restricted in the free version.

Comments(1)

  • 1 year ago

    Pretty! This was an extremely wonderful post. Thank you for supplying this info.

Submit a Comment

Your email address will not be published. Required fields are marked *

Talk to our Experts

Amazing clients who
trust us


poloatto
ABB
polaris
ooredo
stryker
mobility