Thursday, March 26, 2026

dbt internal components : dbt model using SQL or Python

 This diagram presents a high-level architectural blueprint of how dbt Core acts as the orchestration and transformation layer between raw source data and a formalized data warehouse. While the diagrams make it look like data "flows through" dbt, the fundamental takeaway for an architect is that dbt only handles the logic and orchestration (the control plane); the data processing (the data plane) is entirely pushed down to the target warehouse.

1. Source Data (Raw)

  • The Ingestion Layer: This represents the raw, "bronze-layer" data that has been landed in your cloud data warehouse (e.g., Snowflake, BigQuery) by your ELT tools (like Fivetran, Airbyte, or custom Python scripts).

  • SME Insight: From a dbt perspective, these are entirely external tables that dbt does not manage. We only interact with them through {{ source() }} macros, which allow us to define their location and properties (like freshness) in YAML.

  • Architectural Role: It is the foundational substrate for all downstream transformation. The data must be accessible by the dbt adapter’s credentials.

2. dbt Project & Model Definitions (.sql & .py)

  • The Blueprint: This is the developer's workspace—the codebase managed in Git. It contains all the instructions for how the data should be modeled.

  • SQL Model: The primary declarative method. It uses Select Statements combined with Jinja templating to create modular, reusable logic. config() macros at the top define how the database object will be materialized (as a table, view, etc.).

  • Python Model (SME Note): Introduced more recently, these models allow data teams to leverage Python libraries (like Pandas or scikit-learn) for operations that are impossible or highly complex in SQL (like machine learning inference or advanced statistical transformations). They are typically executed in the target warehouse's native Python environment (e.g., Snowflake Snowpark or BigQuery’s pandas API).

  • Principal Engineer View: The dbt project is where the principles of Software Engineering (version control, modularity, dry code) are applied to Data Engineering.

3. Manifest/Parser

  • The Project Cataloger: The Manifest/Parser is the "brain" that reads the entire project. It scans every .sql, .py, and .yml file in your repository to build an in-memory inventory of your models, tests, seeds, sources, and snapshots.

  • Dependency Resolution (DAG): Its critical output is the creation of the Directed Acyclic Graph (DAG). By reading the {{ ref() }} and {{ source() }} macros, it determines the order in which models must be built. If model_B refs model_A, it knows to build model_A first.

  • Validation: The parser validates the syntax, project structure, and config settings to ensure the project is "runnable." It’s the gatekeeper. If the parser finds an error, dbt stops.

4. Jinja Renderer

  • The Dynamic Engine: The parser identified where {{ ref() }} was used, but the Jinja Renderer is what actually compiles it. This is where the magic of "declarative SQL" comes from.

  • Environment Context: The renderer replaces abstract references like {{ ref('customers') }} with the fully qualified, environment-specific database paths (e.g., prod_db.core.customers in production, or dev_db.alice_schema.customers in development). This single feature makes code-switching between environments seamless.

  • Logic Execution: It processes {% if %}, {% for %}, and custom macros to dynamically generate different SQL or configurations based on input variables or environment settings.

5. Compilation Engine

  • The Transpiler: Now that Jinja has replaced the abstract terms, the Compilation Engine takes over. Its job is to generate the final, pure-dialect SQL that is ready to be executed.

  • Wrapping the Logic: This engine takes your SELECT query and wraps it in the necessary DDL/DML, depending on the materialization setting. For a materialized='table', it will wrap your SELECT in a CREATE OR REPLACE TABLE AS.... For an incremental model, it will generate a complex MERGE or INSERT statement based on the logic you defined.

  • Adapter Awareness: The compilation engine uses the Target DB Adapters (see below) to know exactly what SQL dialect (Snowflake, BigQuery, Postgres) to generate.

6. Artifacts Generator

  • The Persistent Memory: As dbt runs, it saves its state and knowledge into two main files:

    • manifest.json: A full, JSON-readable dump of everything the Manifest/Parser found. This is dbt’s "project-at-rest" state. It is used downstream by other tools and for State-Based Execution, allowing dbt to run only the models that have changed since the last run.

    • catalog.json: A JSON file generated after a dbt docs generate command. It contains the schema, column definitions, and data types from the target warehouse for all managed models.

  • Metadata Foundation: These artifacts are essential for dbt’s automated documentation website and for integration with modern data catalogs.

7. Execution Engine (dbt-core)

  • The Conductor: The Execution Engine is the final orchestrator. It doesn't perform data work itself, but it knows when and where that work should happen.

  • Job Management: It takes the generated SQL/Python from the Compilation Engine and dispatches it to the cloud data warehouse via the established database connection. It manages the threads (concurrency), sending parallel queries to the warehouse when the DAG allows.

  • SME Insight: The Execution Engine handles the lifecycle of the job. It establishes the connection, starts the query, waits for the result, records performance metrics (start time, end time, status), and handles the commit/rollback logic on completion or error. It is the component that makes dbt a full-fledged deployment coordinator.

8. Target DB Adapters (Snowflake, BigQuery, etc.)

  • The Translation Layer: This is the abstraction layer that allows dbt Core to remain dialect-agnostic. The adapters (e.g., dbt-snowflake, dbt-bigquery) translate abstract dbt commands into the specific API calls and SQL dialects required by that database.

  • Database-Specific Operations: The adapter knows that a table creation statement in Snowflake is different from BigQuery. It handles the nuances of connection authentication, data types, and transactional control.

9. Transformed Data Models & Transformation Execution

  • The Destination: This is the finalized, modeled data that is now ready for analytics (the "gold-layer").

  • Pushed-Down Execution: This is the single most critical concept. All data movement and processing occur entirely inside the Target Cloud Data Warehouse.

    • SQL models are executed as CREATE TABLE AS... or INSERT/MERGE SQL commands directly against the warehouse engine.

    • Python models are packaged and sent to the warehouse’s internal runtime (like Snowpark UDFs) where they execute in specialized, serverless compute containers close to the data.

  • Output: The result is high-quality, trusted tables like dim_customers and fct_orders which are query-ready for your BI tools.

No comments: