Surendra Pulagam

Tuesday, April 7, 2026

Databricks DLT Evolved to Lakeflow SDP

Based on the architecture shown in the diagram, here is a deep dive into each stage of the Lakeflow Spark Declarative Pipelines (SDP) internal process:

1. User Interface / Pipeline Definition

The journey begins with the Declarative Definition. Unlike traditional Spark where you write imperative code (telling the engine how to move data), here you define the desired state.

Declarative Logic: You use Python or SQL to define tables and views.
The Change: This layer abstracts away the complexity of managing checkpoints, schema evolution, and state. The engine reads your code to understand the dependencies between your Bronze, Silver, and Gold layers.

2. Lakeflow Pipeline Manager (The Orchestrator)

Before any data moves, the manager performs Parsing & Validation.

Graph Analysis: It identifies the "lineage" of your data. If the Gold layer depends on the Silver layer, the manager ensures they are sequenced correctly.
Schema Check: It validates that the source system (like an S3 bucket or Kafka stream) matches the definitions in your code to prevent downstream failures.

3. Databricks Runtime (DBR) Lakeflow Engine

This is the "brain" of the operation, containing the Enzyme Optimization & State Management layer.

High-Level Declarative Logic to Execution Plans: The engine translates your simple Python/SQL into complex Spark execution plans.
Micro-Batch & Streaming Incrementalization: This is the core secret. Instead of refreshing an entire table, the engine uses incremental processing. It identifies only the new or changed data since the last run.
State Store: The engine maintains a "memory" of what has been processed. This prevents data duplication and ensures "exactly-once" processing semantics.
Validation & Data Quality (Expectations): As data flows through the engine, it is checked against your "Expectations" (e.g., Email must not be null). If data fails, the engine can drop the record, alert you, or stop the pipeline based on your settings.

4. Dataflow Execution (DAG) & Unity Catalog

This is the "muscle" where the actual Spark cluster processes the data through the Medallion Architecture.

a. Source Ingestion: Data is pulled from Lakeflow Connect or Cloud Storage.
b. Bronze Layer: Data is landed in its raw format. The system performs a Transactional Commit, recording exactly which files were ingested.
c. Silver Layer: The engine applies transformations, filtering, and cleaning. It interacts with Unity Catalog to store metadata and lineage, ensuring you can track exactly where a piece of data came from.
d. Gold Layer: Data is aggregated into Materialized Views for business intelligence.
Transactional Commit & State Tracking: At every step (arrows pointing to the bottom), the system logs the transaction. If the cluster crashes mid-way, it uses these logs to pick up exactly where it left off without losing data.

Sunday, March 29, 2026

Databricks Lakeflow + Lakeflow SDP

🧠 1. Lakeflow (formerly DLT) — Control Plane for Pipelines

This is the pipeline orchestration and reliability layer.

🔹 Core Responsibilities

Declarative pipeline definition
- You define tables using SQL / Python (CREATE LIVE TABLE)
- System determines execution DAG automatically
Dependency graph management
- Builds DAG from table dependencies
- Example:
  - raw → cleaned → aggregated
- Handles ordering + recomputation
Built-in data quality (Expectations)
- Define rules like:
  - expect(order_id IS NOT NULL)
- Actions:
  - Drop
  - Fail
  - Warn
Error handling & retries
- Automatic retry of failed stages
- Fault isolation per node

⚡ 2. Lakeflow SDP — Declarative Execution Layer

This is where things get more advanced than traditional Spark.

🔹 What SDP Changes

Moves from:
- Imperative Spark (df.join().groupBy())
To:
- Declarative pipeline specs

🔹 Internal Behavior

Builds a logical plan across the entire pipeline
Enables:
- Cross-stage optimization
- Better join planning
- Reduced shuffles

🔹 Key Advantage

Engine understands full pipeline intent, not just individual queries

👉 Think of SDP as:

“Catalyst Optimizer + Pipeline Awareness”

🧾 3. ETL / ELT Pipeline Definition (Center Box)

This is the entry point of everything.

🔹 What Happens Here

You define:
- Source
- Target dataset
- Transformation logic

🔹 Example (conceptual)

Input: sales_data
Output: clean_orders
Logic:
- Filter nulls
- Join dimensions
- Aggregate metrics

🔹 Under the Hood

Converted into:
- Logical DAG
- Optimized execution plan (via SDP)

📥 4. Source Data Layer

Handles ingestion from multiple systems.

🔹 Types of Inputs

Streaming
- Kafka, Event Hubs
- Continuous ingestion
Batch
- Data warehouses
- Scheduled loads
Files / IoT
- S3 / ADLS / GCS
- JSON, Parquet, CSV

🔹 Key Internals

Auto-detection of schema (optional)
Incremental ingestion support
Checkpointing for streaming

🔄 5. Pipeline Orchestration Layer

This is the heart of Lakeflow runtime.

🔹 Dependency Management

Builds DAG from table relationships
Executes nodes in correct order
Tracks lineage

🔹 Validation & Monitoring

Data quality checks executed here
Metrics collected:
- Row counts
- Error rates
- Latency

🔹 Error Handling

Node-level failure isolation
Retry policies
Partial pipeline recovery

⚙️ 6. Lakeflow Engine (Execution Core)

This is where actual data processing happens.

🔹 Auto Optimization

Chooses:
- Join strategies
- Partitioning
- Shuffle behavior

🔹 Data Quality Monitoring

Executes expectations inline
Tracks pass/fail metrics

🔹 Incremental Processing

Only processes new or changed data
Uses:
- Change Data Feed (CDF)
- Checkpoints

👉 This is critical for:

Streaming pipelines
Cost optimization

📊 7. Optimized Data Tables (Output Layer)

Outputs are stored as Delta tables.

🔹 Table Types

Silver
- Cleaned, validated data
Gold
- Aggregated, business-ready data
ML Tables
- Feature-ready datasets

🔹 Internal Optimizations

Z-ordering / Liquid clustering
File compaction
Schema evolution

🔁 8. Continuous + Batch Processing

Lakeflow supports hybrid execution.

🔹 Real-Time (Streaming)

Micro-batch or continuous mode
Near real-time updates

🔹 Scheduled Jobs

Batch pipelines (hourly/daily)
Backfills supported

🔹 Key Insight

Same pipeline can handle:

Streaming + batch (Unified model)

🔌 9. SQL / API Access Layer

Final consumption layer.

🔹 BI Tools

Power BI, Tableau
Query Gold tables

🔹 ML Workloads

Feature engineering
Model training

🔹 APIs

JDBC/ODBC
REST APIs

🚀 10. Key Benefits (Why This Architecture Matters)

🔹 Simplified Pipelines

No manual DAG orchestration
Less Spark code

🔹 Improved Reliability

Built-in quality checks
Automatic retries

🔹 Performance Gains

SDP-level optimizations
Incremental processing
Reduced data movement

🔥 How Everything Connects

Flow:

Define pipeline (Lakeflow + SDP)
→ Build global logical plan
→ Orchestrate via DAG engine
→ Execute with optimized Spark engine
→ Store in Delta tables
→ Serve via SQL / ML / APIs

✔️ Why SDP is powerful

Traditional Spark optimizes per query
SDP optimizes entire pipeline graph

✔️ Why Lakeflow beats Airflow-style orchestration

No external scheduler needed
Data + compute tightly coupled
Native understanding of data dependencies

✔️ Real bottleneck solved

Eliminates:

Redundant reads/writes
Excess shuffles
Manual error handling

✔️ Where it fits in modern architecture

Replaces:
- Airflow + Spark jobs
- Custom ETL frameworks
Works with:

Unity Catalog (governance)
Photon (execution speed)

🔗 How Lakeflow + SDP Work Together

Lakeflow:

Focus: Pipeline reliability + orchestration

SDP:

Focus: Declarative transformation logic

Combined:

SDP defines intent
Lakeflow executes optimized pipeline DAG

👉 Think of it like:

SDP = what you want
Lakeflow = how it gets executed efficiently

⚡ Advanced Concepts (Professional-Level)

🔶 Optimization Opportunities

Whole-stage code generation (Photon)
Join reordering via AQE
Dynamic partition pruning

🔶 State Management

Checkpoints stored in cloud storage
Ensures fault tolerance

🔶 Idempotency

Pipelines designed to be re-runnable safely

🔶 Scalability

Horizontal scaling via Spark clusters
Serverless abstracts infra

Think in layers:

Data Ingestion → Auto Loader
Pipeline Definition → Lakeflow + SDP
Orchestration → DAG + monitoring
Execution Engine → Spark + Photon
Storage → Delta Lake
Consumption → SQL / BI / ML

Databricks internals & Optimization concepts

How They All Connect

Lakeflow + SDP → pipeline definition layer
Unity Catalog → governance layer
Serverless + Photon → compute + execution layer
Liquid Clustering + Lakebase → storage & architecture
AQE + Join Optimization + Spill + Skew → runtime performance optimization

🔶 Lakeflow (Delta Live Tables - DLT)

What it is:
A declarative ETL/ELT framework for building reliable data pipelines.

Key ideas:

Define what transformations should happen, not how
Automatically handles:
- Orchestration
- Dependency resolution
- Error handling
Built-in data quality checks (expectations)

Why it matters:

Reduces pipeline complexity
Improves reliability and maintainability

🔶 Lakeflow Spark Declarative Pipelines (SDP)

What it is:
A newer declarative layer over Spark for defining transformations as pipelines.

Key ideas:

Uses high-level pipeline definitions instead of imperative Spark code
Optimized execution planning under the hood
Integrates tightly with Lakeflow

Why it matters:

Less boilerplate Spark code
Better optimization opportunities by the engine

🔶 Unity Catalog (UC)

What it is:
A centralized governance layer for data and AI assets.

Key ideas:

Fine-grained access control (table, column, row level)
Unified metadata across:
- Tables
- Files
- Models
Data lineage tracking

Why it matters:

Enables secure, governed data sharing
Critical for enterprise data platforms

🔶 Serverless & Photon

What it is:
Compute + execution engine optimization.

Serverless

No cluster management
Auto-scaling compute
Pay-per-use model

Photon

Native vectorized execution engine (C++)
Replaces JVM-based Spark execution

Why it matters:

Faster queries (Photon)
Zero infrastructure overhead (Serverless)

🔶 Liquid Clustering

What it is:
An advanced data layout technique replacing static partitioning.

Key ideas:

Automatically reorganizes data based on query patterns
No need to predefine partitions
Works well with changing workloads

Why it matters:

Avoids partition skew issues
Improves query performance dynamically

🔶 Lakebase

What it is:
A conceptual foundation combining:

Data lake + warehouse capabilities

Key ideas:

Built on Delta Lake
Supports:
- BI workloads
- Streaming
- AI/ML

Why it matters:

Eliminates need for separate systems
Unified analytics platform

🔶 Adaptive Query Execution (AQE)

What it is:
Runtime optimization framework in Spark.

Key ideas:

Adjusts execution plan during runtime
Uses actual data stats instead of estimates

Optimizations include:

Changing join strategies
Coalescing shuffle partitions
Handling skew

Why it matters:

More efficient queries
Better performance on unpredictable data

🔶 Join Optimization

What it is:
Techniques to make joins faster and more efficient.

Key strategies:

Broadcast joins (small table → sent to all nodes)
Sort-merge joins (large datasets)
Shuffle hash joins

Engine decisions:

Based on:
- Table size
- Data distribution
- Available memory

Why it matters:

Joins are often the most expensive operations

🔶 Spill Mitigation

What it is:
Managing memory pressure during execution.

Problem:

When data doesn’t fit in memory → spills to disk → slow

Solutions:

Better memory management
Compression
Optimized shuffle behavior

Why it matters:

Prevents major performance degradation

🔶 Data Skew

What it is:
Uneven data distribution across partitions.

Example:

One partition has 90% of data → becomes bottleneck

Solutions:

AQE skew handling
Salting keys
Better partitioning strategies

Why it matters:

Causes slow jobs and stragglers

dbt MCP Server

The dbt MCP server architecture allows for your agent to connect to a variety of tools.

SQL : Tools for executing and generating SQL on dbt Platform infrastructure.

execute_sql: Executes SQL on dbt Platform infrastructure with Semantic Layer support.
text_to_sql: Generates SQL from natural language using project context.

Semantic Layer: dbt Semantic Layer

get_dimensions: Gets dimensions for specified metrics.
get_entities: Gets entities for specified metrics.
get_metrics_compiled_sql: Returns compiled SQL for metrics without executing the query.
list_metrics: Retrieves all defined metrics.
list_saved_queries: Retrieves all saved queries.
query_metrics: Executes metric queries with filtering and grouping options.

Discovery: dbt Discovery API

get_all_macros: Retrieves macros; option to filter by package or return package names only.
get_all_models: Retrieves name and description of all models.
get_all_sources: Gets all sources with freshness status; option to filter by source name.
get_exposure_details: Gets exposure details including owner, parents, and freshness status.
get_exposures: Gets all exposures (downstream dashboards, apps, or analyses).
get_lineage: Gets full lineage graph (ancestors and descendants) with type and depth filtering.
get_macro_details: Gets details for a specific macro.
get_mart_models: Retrieves all mart models.
get_model_children: Gets downstream dependents of a model.
get_model_details: Gets model details including compiled SQL, columns, and schema.
get_model_health: Gets health signals: run status, test results, and upstream source freshness.
get_model_parents: Gets upstream dependencies of a model.
get_model_performance: Gets execution history for a model; option to include test results.
get_related_models: Finds similar models using semantic search.
get_seed_details: Gets details for a specific seed.
get_semantic_model_details: Gets details for a specific semantic model.
get_snapshot_details: Gets details for a specific snapshot.
get_source_details: Gets source details including columns and freshness.
get_test_details: Gets details for a specific test.
search: [Alpha] Searches for resources across the dbt project (not generally available).

dbt CLI: Allowing your client to utilize dbt commands through the MCP tooling could modify your data models, sources, and warehouse objects. Proceed only if you trust the client and understand the potential impact.

build: Executes models, tests, snapshots, and seeds in DAG order.
compile: Generates executable SQL from models/tests/analyses; useful for validating Jinja logic.
docs: Generates documentation for the dbt project.
get_lineage_dev: Retrieves lineage from local manifest.json with type and depth filtering.
get_node_details_dev: Retrieves node details from local manifest.json (models, seeds, snapshots, sources).
list: Lists resources in the dbt project by type with selector support.
parse: Parses and validates project files for syntax correctness.
run: Executes models to materialize them in the database.
show: Executes SQL against the database and returns results.
test: Runs tests to validate data and model integrity.

Admin API: dbt Administrative API

cancel_job_run: Cancels a running job.
get_job_details: Gets job configuration including triggers, schedule, and dbt commands.
get_job_run_artifact: Downloads a specific artifact file from a job run.
get_job_run_details: Gets run details including status, timing, steps, and artifacts.
get_job_run_error: Gets error and/or warning details for a job run; option to include or show warnings only.
get_project_details: Gets project information for a specific dbt project.
list_job_run_artifacts: Lists available artifacts from a job run.
list_jobs: Lists jobs in a dbt Platform account; option to filter by project or environment.
list_jobs_runs: Lists job runs; option to filter by job, status, or order by field.
list_projects: Lists all projects in the dbt Platform account.
retry_job_run: Retries a failed job run.
trigger_job_run: Triggers a job run; option to override git branch, schema, or other settings.

dbt Codegen: These tools help automate boilerplate code generation for dbt project files.

generate_model_yaml: Generates model YAML with columns; option to inherit upstream descriptions.
generate_source: Generates source YAML by introspecting database schemas; option to include columns.
generate_staging_model: Generates staging model SQL from a source table.

dbt LSP : A set of tools that leverage the Fusion engine for advanced SQL compilation and column-level lineage analysis.

fusion.compile_sql: Compiles SQL in project context via dbt Platform.
fusion.get_column_lineage: Traces column-level lineage via dbt Platform.
get_column_lineage: Traces column-level lineage locally (requires dbt-lsp via dbt Labs VSCE).

Product Docs: Tools for searching and fetching content from the official dbt documentation at docs.getdbt.com.

get_product_doc_pages: Fetches the full Markdown content of one or more docs.getdbt.com pages by path or URL.

search_product_docs: Searches docs.getdbt.com for pages matching a query; returns titles, URLs, and descriptions ranked by relevance. Use get_product_doc_pages to fetch full content.

MCP Server Metadata: These tools provide information about the MCP server.

get_mcp_server_branch: Returns the current git branch of the running dbt MCP server.
get_mcp_server_version: Returns the current version of the dbt MCP server.

Friday, March 27, 2026

dbt with Databricks Data Pipeline

1. The Databricks Lakehouse Platform (The Engine)

This is the "Data Plane" and "Compute Plane" at the bottom left.

Data Plane: Raw data lands as Delta Tables. The key internal here is the ACID Transaction Log (_delta_log), which ensures that even if a dbt run fails halfway through, your data stays consistent.
Intelligence Plane (Unity Catalog): This is the "brain" that manages the Metastore, security, and audit logs. It allows dbt to know exactly where the tables live and who has permission to change them.
Compute Plane: When you execute a dbt command, it triggers Photon Engines or SQL Warehouses. Databricks does the actual "crunching" of the numbers; dbt just sends the instructions.

2. The dbt Core/Cloud Project (The Logic)

This is the "Logical Plane" at the top, representing your code repository (Git).

profiles.yml: The "handshake" file that tells dbt how to log into Databricks using a Personal Access Token (PAT).
Models (Staging & Marts): Your SQL files. Staging models handle the "Silver" cleaning (renaming, casting), while Marts models create the "Gold" business-ready data.
Snapshots & Macros: Internals that handle Slowly Changing Dimensions (SCD Type 2) to track history and reusable code snippets to keep the project "DRY" (Don't Repeat Yourself).
dbt Docs: Automatically generated documentation that shows the Lineage (the map of how data flows from A to Z).

3. The Transformation Lifecycle (The Refinery)

The center of the image shows the flow from raw to refined:

Bronze (Raw): Untouched data.
Silver (Filtered): dbt applies filters, deduplication, and standardized naming conventions.
Gold (Standardized): The final, high-quality data. Here, dbt Tests (unique, not_null) act as "quality gates" to ensure no bad data reaches the end users.

4. Downstream Sharing (The Value)

The right side shows how different teams "check out" the finished product from the Gold layer:

BI Teams: Use SQL Warehouses to power dashboards.
Data Science: Accesses the same tables for Feature Stores (ML inputs).
AI/LLM Teams: Pulls cleaned text into Vector Search indexes for RAG (Retrieval-Augmented Generation).
External Partners: Use Delta Sharing to access the data securely without needing to copy it to another system.

==============================================================

Setting up dbt (data build tool) with Databricks is a powerful way to implement a "Medallion Architecture" (Bronze, Silver, Gold). Since Databricks handles the compute and dbt handles the transformation logic via SQL, you get a highly scalable, version-controlled data pipeline.

1. Establishing the Connection

Before writing SQL, you need to bridge the two platforms.

Databricks Side: Go to SQL Warehouses, select your warehouse, and navigate to Connection Details. You will need the Server Hostname, HTTP Path, and a Personal Access Token (PAT).
dbt Side: If using dbt Cloud, use the Databricks adapter. If using dbt Core, install dbt-databricks.
Profile Configuration: Your profiles.yml (or Cloud setup) should look like this:

Type: databricks
Method: token
Catalog: main (Unity Catalog is highly recommended for sharing).

2. Data Ingestion (The "Bronze" Layer)

dbt is primarily a transformation tool, not an ingestion tool. However, you have two main paths:

External Ingestion: Use tools like Fivetran, Airbyte, or Azure Data Factory to land raw data into Delta tables.
dbt-as-Ingestion: Use dbt Seeds for small, static CSVs (like mapping tables) or define Sources that point to tables created by Databricks Auto Loader.

3. Data Cleansing (The "Silver" Layer)

This is where dbt shines. In this layer, you transform raw data into "clean" entities.

Filtering & Deduplication: Use SELECT DISTINCT or window functions to remove duplicates.
Casting: Ensure dates are DATE types and strings are trimmed.
Renaming: Use dbt models to change cryptic source column names (e.g., src_usr_id) to readable ones (user_id).
Implementation: Create a model in models/silver/ and use the ref() function to pull from your Bronze sources.

4. Maintaining Data Standardization

To keep your data "Golden," you must enforce quality and consistency across the organization.

dbt Tests: Define unique, not_null, and accepted_values in your schema.yml.
Macros: Use dbt macros for reusable logic. For example, if you have a specific way to calculate "Fiscal Year," write a macro once and call it in every model to ensure everyone uses the same math.
Packages: Utilize dbt_utils for complex cross-database testing and transformations.
Documentation: Run dbt docs generate to create a searchable data catalog so users know exactly what a column represents.

5. Data Sharing to Downstream Teams

Once your "Gold" layer is ready, you need to deliver it. This is where Unity Catalog on Databricks becomes your best friend.

For Analytics & BI

Materialization: Set your Gold models to materialized='table' or incremental.
BI Tools: Tools like Tableau, Power BI, or Databricks SQL Dashboards connect directly to these Gold tables.

For Data Science & ML

Feature Store: DS teams can use your dbt-cleansed tables as the foundation for the Databricks Feature Store.
Python Integration: ML engineers can pull dbt models directly into notebooks using: spark.table("main.gold.customer_lifetime_value")

For AI & LLMs (RAG)

Vector Search: You can sync your dbt-processed text data into Databricks Vector Search indexes to power AI chatbots with clean, "truthful" organizational data.

External Sharing

Delta Sharing: Use Databricks Delta Sharing to provide read-only access to your dbt models for partners or vendors outside your Databricks environment without moving the data.

Thursday, March 26, 2026

STEP-BY-STEP: How dbt Builds the DAG

1. Parse files

2. Identify refs/sources

3. Build nodes

4. Create dependency graph

5. Validate DAG

6. Topological sort

7. Compile SQL

8. Execute DAG

9. Render UI graph

dbt manifest.json internals

Deep into dbt’s manifest.json—this is one of the most important internal artifacts in dbt and is the brain of your project DAG.

What is `manifest.json`?

manifest.json is a compiled metadata file generated when you run:

</> Bash

dbt run

dbt compile

dbt docs generate

📌 It contains:

All models, sources, tests, macros
Dependency graph (DAG)
Compiled SQL
Column metadata
Lineage relationships

👉 Think of it as:


dbt project → parsed → compiled → manifest.json (single source of truth)

📦 High-Level Structure
{
  "metadata": {},
  "nodes": {},
  "sources": {},
  "macros": {},
  "parent_map": {},
  "child_map": {},
  "docs": {},
  "exposures": {},
  "metrics": {}
}

🔍 1. metadata Section
🔹 What it contains
"metadata": {
  "dbt_version": "1.x.x",
  "project_name": "retail_dbt",
  "generated_at": "timestamp",
  "adapter_type": "snowflake"
}
🔹 Why it matters
Tracks dbt version compatibility
Helps debugging pipeline issues
🧩 2. nodes (CORE of dbt)
This is the most important section.
🔹 What are nodes?
Everything dbt builds:
models
tests
seeds
snapshots
🔹 Example: Model Node
"model.retail_dbt.stg_customers": {
  "resource_type": "model",
  "name": "stg_customers",
  "database": "RETAIL_DB",
  "schema": "STAGING",
  "alias": "stg_customers",
  "raw_code": "SELECT * FROM {{ source('raw','customers_raw') }}",
  "compiled_code": "SELECT * FROM RETAIL_DB.RAW.CUSTOMERS_RAW",
  "depends_on": {
    "nodes": ["source.retail_dbt.raw.customers_raw"]
  },
  "config": {
    "materialized": "view"
  }
}
🔹 Key Fields Explained
✅ raw_code
Your original SQL with Jinja
✅ compiled_code
Final SQL sent to Snowflake
✅ depends_on
Defines DAG edges
✅ config
Materialization (view/table/incremental)
🌐 3. sources
🔹 Example
"source.retail_dbt.raw.customers_raw": {
  "database": "RETAIL_DB",
  "schema": "RAW",
  "identifier": "CUSTOMERS_RAW"
}
🔹 Purpose
Maps dbt → physical tables

Enables lineage tracking
🧬 4. parent_map (UPSTREAM)
🔹 Example
"model.retail_dbt.stg_customers": [
  "source.retail_dbt.raw.customers_raw"
]
🔹 Meaning
Who feeds into this model
🔗 5. child_map (DOWNSTREAM)
🔹 Example
"model.retail_dbt.stg_customers": [
  "model.retail_dbt.dim_customers"
]
🔹 Meaning
Who depends on this model
🧠 DAG Insight
Together:
parent_map + child_map = full DAG graph
This powers:
dbt lineage UI

model execution order
🧰 6. macros
🔹 Example"macro.dbt_utils.generate_surrogate_key": {
  "name": "generate_surrogate_key",
  "macro_sql": "md5(concat(...))"
}
🔹 Purpose
Stores reusable logic

Used during compilation
🧪 7. tests
Stored inside nodes
🔹 Example
"test.retail_dbt.unique_customer_id": {
  "resource_type": "test",
  "depends_on": {
    "nodes": ["model.retail_dbt.stg_customers"]
  }
}
🔹 Purpose
Defines validation logic

Linked to models
📚 8. docs & exposures
🔹 Docs
Column descriptions

Model descriptions
🔹 Exposures
"exposure.dashboard.sales_dashboard": {
  "type": "dashboard",
  "depends_on": {
    "nodes": ["model.retail_dbt.fact_orders"]
  }
}
👉 Connects dbt → BI tools like:
Tableau

Power BI
HOW dbt USES manifest.json INTERNALLY
Step-by-step:
1. Parse Phase
Reads SQL + YAML → builds nodes
2. Compile Phase
Resolves:

ref()

source()

macros
3. DAG Build
Uses:depends_on

parent_map
4. Execution Planning
Orders models correctly
5. Run Phase
Executes compiled_code
6. Docs Generation
Uses manifest for lineage graph
🎯 Key Takeaways
manifest.json = central brain of dbt

Stores:models
dependencies
compiled SQL
Powers:execution
lineage
CI/CD
Essential for:debugging
optimization
orchestration