Thursday, March 26, 2026

production-level walkthrough

 Below is a deep, production-level walkthrough of each step in your dbt pipeline—from raw → staging → incremental → SCD → marts → testing → deployment—including:

  • dbt internal components involved
  • SQL / Python role
  • Why the step exists
  • How Git supports each step

STEP 1: RAW DATA EXTRACT (Sources)

🔹 What happens

  • Data lands in Snowflake raw schema via:
    • COPY INTO / Snowpipe
  • dbt does NOT ingest, but registers sources

🔹 dbt Internal Components

  • Source Config (sources.yml)
  • Parser
  • Adapter (Snowflake adapter)

🔹 SQL / Python Role

SELECT * FROM {{ source('raw','customers_raw') }}

  • SQL references raw tables
  • Python (optional): external ingestion pipelines

🔹 Why required

  • Establish trusted entry point
  • Enables lineage tracking
  • Prevents direct table coupling

🔹 Git Role

  • Version control for:
    • sources.yml
  • Tracks schema evolution
  • Enables rollback if source changes break pipeline.

STEP 2: STAGING LAYER

🔹 What happens

  • Clean, standardize, rename columns
  • Apply light transformations

🔹 dbt Internal Components

  • Model Parser
  • Jinja Engine (macros)
  • Compiler
  • Execution Engine

🔹 SQL Example

SELECT
  customer_id,
  LOWER(email) AS email,
  TRIM(name) AS name
FROM {{ source('raw','customers_raw') }}

🔹 Why required

  • Removes inconsistencies
  • Creates reusable canonical layer

🔹 Git Role

  • Tracks column renaming decisions
  • Enables code review for transformations
  • Collaboration across teams

STEP 3: INCREMENTAL MODELS

🔹 What happens

  • Load only new or changed data
  • Avoid full table rebuilds

🔹 dbt Internal Components

  • Materialization Engine (incremental)
  • Macro Engine (is_incremental())
  • Adapter-specific SQL generation

🔹 SQL Example

SELECT *

FROM {{ ref('stg_customers') }}


{% if is_incremental() %}

WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})

{% endif %}

🔹 Why required

  • Improves performance
  • Reduces compute cost in Snowflake

🔹 Git Role

  • Tracks logic changes for incremental filters
  • Prevents accidental full refresh bugs
  • Enables safe experimentation via branches

STEP 4: SCD TYPE 2 (HISTORICAL TRACKING)

🔹 What happens

  • Tracks historical changes in dimensions
  • Maintains:
    • surrogate keys
    • effective dates
    • current flag

🔹 dbt Internal Components

  • Custom Macros (SCD logic)
  • Incremental + MERGE strategy
  • Dependency Graph (ref)

🔹 SQL Example (MERGE)

MERGE INTO dim_customers t

USING stg_customers s

ON t.customer_id = s.customer_id AND t.is_current = TRUE


WHEN MATCHED AND t.email <> s.email THEN

UPDATE SET is_current = FALSE, effective_end_date = CURRENT_TIMESTAMP


WHEN NOT MATCHED THEN

INSERT (...)

🔹 Why required

  • Enables:
    • historical reporting
    • audit tracking
    • slowly changing dimensions

🔹 Git Role

  • Critical for:
    • tracking SCD logic changes
    • auditing business rule evolution
  • Enables peer review for complex logic

STEP 5: DATA MART (GOLD LAYER)

🔹 What happens

  • Build fact and dimension tables
  • Optimize for BI queries

🔹 dbt Internal Components

  • Ref Graph (DAG builder)
  • Materializations (table/view)
  • Query Planner

🔹 SQL Example

SELECT
  o.order_id,
  c.customer_sk,
  o.order_amount
FROM {{ ref('fact_orders') }} o
JOIN {{ ref('dim_customers') }} c

🔹 Why required

  • Serves:
    • dashboards
    • analytics
  • Improves performance via denormalization

🔹 Git Role

  • Tracks business logic
  • Maintains consistency across metrics
  • Enables versioned analytics definitions

STEP 6: TESTING & VALIDATION

🔹 What happens

  • Validate data quality:
    • uniqueness
    • null checks
    • referential integrity

🔹 dbt Internal Components

  • Test Runner
  • Schema YAML parser
  • Assertion engine

🔹 YAML Example

columns:
  - name: customer_id
    tests:
      - not_null
      - unique

🔹 Why required

  • Prevents bad data propagation
  • Ensures trust in analytics

🔹 Git Role

  • Tracks test coverage
  • Enforces quality via CI/CD

  • Prevents bad merges
STEP 7: DOCUMENTATION & DEPLOYMENT

🔹 What happens

  • Generate lineage docs
  • Deploy models via jobs

🔹 dbt Internal Components

  • Docs Generator
  • Manifest.json
  • Run Artifacts
  • DAG Renderer

🔹 Commands

dbt docs generate
dbt docs serve

🔹 Why required

  • Provides:
    • lineage graph
    • column-level documentation
  • Helps onboarding & governance

🔹 Git Role

  • Stores:
    • documentation YAML
  • Enables:
    • versioned lineage
    • auditability

Internal Flow inside dbt:

  1. Parser
    • Reads SQL + YAML
  2. Jinja/Macro Engine
    • Expands dynamic logic
  3. Compiler
    • Converts to raw SQL
  4. Adapter (Snowflake)
    • Optimizes SQL
  5. Execution Engine
    • Runs queries
  6. Test Engine
    • Validates output
  7. Docs Generator
    • Builds lineage + metadata
RAW → SOURCE → STAGING → INCREMENTAL → SCD → MART → TEST → DOCS

🚀 Key Takeaways

  • dbt is a compiler + orchestrator, not ingestion tool
  • SQL is the primary transformation layer
  • Python is optional (advanced models)
  • Git enables:
    • collaboration
    • versioning
    • CI/CD
  • Each step builds trust, performance, and scalability

No comments: