Wednesday, March 25, 2026

End-to-End dbt ecosystem

 





This diagram visualizes the complete end-to-end dbt ecosystem incorporating all the technologies we discussed (Snowflake, Databricks, Fabric, Python models, Airflow, and advanced CI/CD).

The flow moves from left (Sources) to right (Marts and Delivery). Here is a detailed breakdown of each architectural stage and the data flow.

Data Sources & The Staging Layer (The DLH/W)

This is the intake and initial storage area. Raw data arrives from operational systems via Airflow-orchestrated ingestion tasks (using Fivetran/Airbyte) and lands in the unified Data Lakehouse/Warehouse (marked as a large cube supporting Snowflake Horizon, Databricks Unity, and MS Fabric OneLake).

The diagram illustrates two critical concepts we’ve covered:

  • Layered Storage: The warehouse is partitioned into a Bronze/Raw zone (raw data) and a Silver/Staging zone (initial dbt stg_ clean-up models).

  • The Codegen Loop: A specialized codegen operation is shown running outside the main DAG execution. It queries the data warehouse INFORMATION_SCHEMA (the process we visualized in image_0.png) and automatically generates the YAML source definitions and SQL base models. This workflow accelerates development time and reduces manual errors when onboarding new datasets.

The Transformation & Orchestration Layer (The Core DAG)

This is the heart of the modern dbt solution. The execution is segmented into the internal steps crucial for production resilience.

The Internal dbt Lifecycle:

  1. Model Parsing: Dbt-core (or Cloud) builds the dependency graph (DAG) by interpreting all SQL ref() macros.

  2. Compile & Manifest: The manifest.json is generated, which is the artifact required by Airflow Cosmos for dynamic task generation.

  3. Model Execution (dbt Build): The DAG is executed against the warehouse. The diagram explicitly displays polyglot engineering (SQL mixed with Python):

    • SQL Path: The execution of standard relational models (int_campaign_performance flowing into fct_ad_performance) which are pushed down to the warehouse engine.

    • Python Path: The execution of a complex model (py_ml_customer_segmentation). This triggers a distinct Snowpark Python Call (in Snowflake) or a PySpark Call (in Databricks/Fabric), processing the statistics and saving the result.

  4. Test & Governance: Once built, data quality assertions are run (dbt-expectations) and YAML constraints (Schema Contracts) are enforced.

Orchestration & Governance Sync:

  • Airflow DAG: The Airflow orchestration (via Astronomer Cosmos provider) maps directly to the dbt steps: Parse -> Compile -> Execute (Dynamic Task).

  • Catalog Sync: Successful dbt builds are synchronized with the respective warehouse governance tools, such as Databricks Unity Catalog, ensuring fine-grained access control is applied to the newly created data marts.

Production & Delivery

This column represents the finalized, production-grade state. The built, tested, and governed models are available in the Gold/Marts presentation schema. The data is now available to consumption engines for different business use cases: BI Dashboards, ML models (for inference), and Operational Sync (Reverse ETL to push data back to Salesforce, etc.).

Advanced Platform Engineering: CI/CD Pipeline (Bottom Right)

This flow details how a Principal Engineer ensures data platform stability:

  1. Commit Trigger: A developer pushes code (image_1.png references).

  2. Slim CI Build: The pipeline runs only the modified models and their downstream dependents. It uses the dbt Build --defer --state flag to "defer" to the production schema for unmodified upstream tables, saving execution cost.

  3. Blue/Green Deployment: For the final release, a Zero-Copy Clone (Snowflake specific) is created to instantly swap production and staging environments, allowing zero-downtime, safe rollbacks.

No comments: