Surendra Pulagam: production-level walkthrough

Below is a deep, production-level walkthrough of each step in your dbt pipeline—from raw → staging → incremental → SCD → marts → testing → deployment—including:

dbt internal components involved
SQL / Python role
Why the step exists
How Git supports each step

STEP 1: RAW DATA EXTRACT (Sources)

🔹 What happens

Data lands in Snowflake raw schema via:
- COPY INTO / Snowpipe
dbt does NOT ingest, but registers sources

🔹 dbt Internal Components

Source Config (sources.yml)
Parser
Adapter (Snowflake adapter)

🔹 SQL / Python Role

SELECT * FROM {{ source('raw','customers_raw') }}

SQL references raw tables
Python (optional): external ingestion pipelines

🔹 Why required

Establish trusted entry point
Enables lineage tracking
Prevents direct table coupling

🔹 Git Role

Version control for:
- sources.yml
Tracks schema evolution
Enables rollback if source changes break pipeline.

STEP 2: STAGING LAYER

🔹 What happens

Clean, standardize, rename columns
Apply light transformations

🔹 dbt Internal Components

Model Parser
Jinja Engine (macros)
Compiler
Execution Engine

🔹 SQL Example

SELECT

customer_id,

LOWER(email) AS email,

TRIM(name) AS name

FROM {{ source('raw','customers_raw') }}

🔹 Why required

Removes inconsistencies
Creates reusable canonical layer

🔹 Git Role

Tracks column renaming decisions
Enables code review for transformations
Collaboration across teams

STEP 3: INCREMENTAL MODELS

🔹 What happens

Load only new or changed data
Avoid full table rebuilds

🔹 dbt Internal Components

Materialization Engine (incremental)
Macro Engine (is_incremental())
Adapter-specific SQL generation

🔹 SQL Example

SELECT *

FROM {{ ref('stg_customers') }}

{% if is_incremental() %}

WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})

{% endif %}

🔹 Why required

Improves performance
Reduces compute cost in Snowflake

🔹 Git Role

Tracks logic changes for incremental filters
Prevents accidental full refresh bugs
Enables safe experimentation via branches

STEP 4: SCD TYPE 2 (HISTORICAL TRACKING)

🔹 What happens

Tracks historical changes in dimensions
Maintains:
- surrogate keys
- effective dates
- current flag

🔹 dbt Internal Components

Custom Macros (SCD logic)
Incremental + MERGE strategy
Dependency Graph (ref)

🔹 SQL Example (MERGE)

MERGE INTO dim_customers t

USING stg_customers s

ON t.customer_id = s.customer_id AND t.is_current = TRUE

WHEN MATCHED AND t.email <> s.email THEN

UPDATE SET is_current = FALSE, effective_end_date = CURRENT_TIMESTAMP

WHEN NOT MATCHED THEN

INSERT (...)

🔹 Why required

Enables:
- historical reporting
- audit tracking
- slowly changing dimensions

🔹 Git Role

Critical for:
- tracking SCD logic changes
- auditing business rule evolution
Enables peer review for complex logic

STEP 5: DATA MART (GOLD LAYER)

🔹 What happens

Build fact and dimension tables
Optimize for BI queries

🔹 dbt Internal Components

Ref Graph (DAG builder)
Materializations (table/view)
Query Planner

🔹 SQL Example

SELECT

o.order_id,

c.customer_sk,

o.order_amount

FROM {{ ref('fact_orders') }} o

JOIN {{ ref('dim_customers') }} c

🔹 Why required

Serves:
- dashboards
- analytics
Improves performance via denormalization

🔹 Git Role

Tracks business logic
Maintains consistency across metrics
Enables versioned analytics definitions

STEP 6: TESTING & VALIDATION

🔹 What happens

Validate data quality:
- uniqueness
- null checks
- referential integrity

🔹 dbt Internal Components

Test Runner
Schema YAML parser
Assertion engine

🔹 YAML Example

columns:

- name: customer_id

tests:

- not_null

- unique

🔹 Why required

Prevents bad data propagation
Ensures trust in analytics

🔹 Git Role

Tracks test coverage
Enforces quality via CI/CD

Prevents bad merges
STEP 7: DOCUMENTATION & DEPLOYMENT

🔹 What happens

Generate lineage docs
Deploy models via jobs

🔹 dbt Internal Components

Docs Generator
Manifest.json
Run Artifacts
DAG Renderer

🔹 Commands

dbt docs generate

dbt docs serve

🔹 Why required

Provides:
- lineage graph
- column-level documentation
Helps onboarding & governance

🔹 Git Role

Stores:
- documentation YAML
Enables:

versioned lineage
auditability

Internal Flow inside dbt:

Parser
- Reads SQL + YAML
Jinja/Macro Engine
- Expands dynamic logic
Compiler
- Converts to raw SQL
Adapter (Snowflake)
- Optimizes SQL
Execution Engine
- Runs queries
Test Engine
- Validates output
Docs Generator

Builds lineage + metadata

RAW → SOURCE → STAGING → INCREMENTAL → SCD → MART → TEST → DOCS

🚀 Key Takeaways

dbt is a compiler + orchestrator, not ingestion tool
SQL is the primary transformation layer
Python is optional (advanced models)
Git enables:
- collaboration
- versioning
- CI/CD
Each step builds trust, performance, and scalability

Surendra Pulagam

Thursday, March 26, 2026

production-level walkthrough

STEP 1: RAW DATA EXTRACT (Sources)

🔹 What happens

🔹 dbt Internal Components

🔹 SQL / Python Role

🔹 Why required

🔹 Git Role

STEP 2: STAGING LAYER

🔹 What happens

🔹 dbt Internal Components

🔹 SQL Example

🔹 Why required

🔹 Git Role

STEP 3: INCREMENTAL MODELS

🔹 What happens

🔹 dbt Internal Components

🔹 SQL Example

🔹 Why required

🔹 Git Role

STEP 4: SCD TYPE 2 (HISTORICAL TRACKING)

🔹 What happens

🔹 dbt Internal Components

🔹 SQL Example (MERGE)

🔹 Why required

🔹 Git Role

STEP 5: DATA MART (GOLD LAYER)

🔹 What happens

🔹 dbt Internal Components

🔹 SQL Example

🔹 Why required

🔹 Git Role

STEP 6: TESTING & VALIDATION

🔹 What happens

🔹 dbt Internal Components

🔹 YAML Example

🔹 Why required

🔹 Git Role

Prevents bad mergesSTEP 7: DOCUMENTATION & DEPLOYMENT

🔹 What happens

🔹 dbt Internal Components

🔹 Commands

🔹 Why required

🔹 Git Role

Internal Flow inside dbt:

🚀 Key Takeaways

No comments:

Forums & communities

Blog Archive

Share

My Blog List

BeyeBLOGS - Latest Blog Postings

Followers

Prevents bad merges
STEP 7: DOCUMENTATION & DEPLOYMENT