Saturday, January 2, 2010

Data Warehousing Terms

Bill Inmon vs. Ralph Kimball

In the data warehousing field, we often hear about discussions on where a person / organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe below the difference between the two.

Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.

There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse.



Aggregation: One way of speeding up query performance. Facts are summed up for selected dimensions from the original fact table. The resulting aggregate table will have fewer rows, thus making queries that can use them go faster.

Attribute: Attributes represent a single type of information in a dimension. For example, year is an attribute in the Time dimension.

Conformed Dimension: A dimension that has exactly the same meaning and content when being referred from different fact tables.

Data Mart: Data marts have the same definition as the data warehouse (see below), but data marts have a more limited audience and/or data content.

Data Warehouse: A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process (as defined by Bill Inmon).

Data Warehousing: The process of designing, building, and maintaining a data warehouse system.

Dimension: The same category of information. For example, year, month, day, and week are all part of the Time Dimension.

Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table records all the "fact", or measures.

Dimensional Table: Dimension tables store records related to this particular dimension. No facts are stored in a dimensional table.

Drill Across: Data analysis across dimensions.

Drill Down: Data analysis to a child attribute.

Drill Through: Data analysis that goes from an OLAP cube into the relational database.

Drill Up: Data analysis to a parent attribute.

ETL: Stands for Extraction, Transformation, and Loading. The movement of data from one area to another.

Fact Table: A type of table in the dimensional model. A fact table typically includes two types of columns: fact columns and foreign keys to the dimensions.

Hierarchy: A hierarchy defines the navigating path for drilling up and drilling down. All attributes in a hierarchy belong to the same dimension.

Metadata: Data about data. For example, the number of tables in the database is a type of metadata.

Metric: A measured value. For example, total sales is a metric.

MOLAP: Multidimensional OLAP. MOLAP systems store data in the multidimensional cubes.

OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users a quick way of slicing and dicing the data.

ROLAP: Relational OLAP. ROLAP systems store data in the relational database.

Snowflake Schema: A common form of dimensional model. In a snowflake schema, different hierarchies in a dimension can be extended into their own dimensional tables. Therefore, a dimension can have more than a single dimension table.

Star Schema: A common form of dimensional model. In a star schema, each dimension is represented by a single dimension table.

Bus, Star, Snow flake Schemas in Data Warehousing
BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.

Star schema is a type of organizing the tables such that we can retrieve the result from the database quickly in the warehouse environment.

Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table.

SCD is abbreviation of Slowly changing dimensions in Data Warehousing.

SCD applies to cases where the attribute for a record varies over time. There are three different types of SCD.

a) SCD1 : The new record replaces the original record.
Only one record exist in database - current data.

b) SCD2 : A new record is added into the customer dimension table.
Two records exist in database - current data and previous history data.

c) SCD3 : The original data is modified to include new data.
One record exist in database - new information are attached with old information in same row.

Hybrid Slowly Changing Dimension in Data Warehousing:
Hybrid SCDs are combination of both SCD 1 and SCD 2.

It may happen that in a table, some columns are important and we need to track changes for them i.e capture the historical data for them whereas in some columns even if the data changes, we don't care.

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

However they are not considered as useless. If there is changes in dimensions the same facts can be useful.
==================================================================================
Dimensional Data Model

Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model.

To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2009 may be represented as "Q1 2009" or "2009 Q1").

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.

Whether one uses a star or a snowflake largely depends on personal preference and business needs. Personally, I am partial to snowflakes, when there is a business case to analyze the information at that particular level.


The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. We give an example below:

Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record:



At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.

There are in general three ways to solve this type of problem, and they are categorized as follows:

Type 1: The new record replaces the original record. No trace of the old record exists.

Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people.

Type 3: The original record is modified to reflect the change.

We next take a look at each of the scenarios and how the data model and the data looks like for each of them. Finally, we compare and contrast among the three alternatives.
------------------------------------------------------------------------------
Conceptual Data Model

A conceptual data model identifies the highest-level relationships between the different entities. Features of conceptual data model include:

Includes the important entities and the relationships among them.
No attribute is specified.
No primary key is specified.

The figure below is an example of a conceptual data model.


From the figure above, we can see that the only information shown via the conceptual data model is the entities that describe the data and the relationships between those entities. No other information is shown through the conceptual data model.
------------------------------------------------------------------------------------

A logical data model describes the data in as much detail as possible, without regard to how they will be physical implemented in the database. Features of a logical data model include:

Includes all entities and relationships among them.
All attributes for each entity are specified.
The primary key for each entity is specified.
Foreign keys (keys identifying the relationship between different entities) are specified.
Normalization occurs at this level.
The steps for designing the logical data model are as follows:

Specify primary keys for all entities.
Find the relationships between different entities.
Find all attributes for each entity.
Resolve many-to-many relationships.
Normalization.

The figure below is an example of a logical data model.

Logical Data Model


Comparing the logical data model shown above with the conceptual data model diagram, we see the main differences between the two:

In a logical data model, primary keys are present, whereas in a conceptual data model, no primary key is present.
In a logical data model, all attributes are specified within an entity. No attributes are specified in a conceptual data model.
Relationships between entities are specified using primary keys and foreign keys in a logical data model. In a conceptual data model, the relationships are simply stated, not specified, so we simply know that two entities are related, but we do not specify what attributes are used for this relationship.
----------------------------------------------------------------------------
Physical Data Model

Physical data model represents how the model will be built in the database. A physical database model shows all table structures, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables. Features of a physical data model include:

Specification all tables and columns.
Foreign keys are used to identify relationships between tables.
Denormalization may occur based on user requirements.
Physical considerations may cause the physical data model to be quite different from the logical data model.
Physical data model will be different for different RDBMS. For example, data type for a column may be different between MySQL and SQL Server.
The steps for physical data model design are as follows:

Convert entities into tables.
Convert relationships into foreign keys.
Convert attributes into columns.
Modify the physical data model based on physical constraints / requirements.
The figure below is an example of a physical data model.


Comparing the logical data model shown above with the logical data model diagram, we see the main differences between the two:

Entity names are now table names.
Attributes are now column names.
Data type for each column is specified. Data types can be different depending on the actual database being used.
--------------------------------------------------------------------------
Conceptual, Logical, And Physical Data Models

The three level of data modeling, conceptual data model, logical data model, and physical data model, were discussed in prior sections. Here we compare these three types of data models. The table below compares the different features:


We can see that the complexity increases from conceptual to logical to physical. This is why we always first start with the conceptual data model (so we understand at high level what are the different entities in our data and how they relate to one another), then move on to the logical data model (so we understand the details of our data without worrying about how they will actually implemented), and finally the physical data model (so we know exactly how to implement our data model in the database of choice). In a data warehousing project, sometimes the conceptual data model and the logical data model are considered as a single deliverable.
----------------------------------------------------------------------------------
Data Integrity

Data integrity refers to the validity of data, meaning data is consistent and correct. In the data warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no data integrity in the data warehouse, any resulting report and analysis will not be useful.

In a data warehouse or a data mart, there are three areas of where data integrity needs to be enforced:

Database level
We can enforce data integrity at the database level. Common ways of enforcing data integrity include:

Referential integrity
The relationship between the primary key of one table and the foreign key of another table must always be maintained. For example, a primary key cannot be deleted if there is still a foreign key that refers to this primary key.

Primary key / Unique constraint
Primary keys and the UNIQUE constraint are used to make sure every row in a table can be uniquely identified.

Not NULL vs NULL-able
For columns identified as NOT NULL, they may not have a NULL value.

Valid Values

Only allowed values are permitted in the database. For example, if a column can only have positive integers, a value of '-1' cannot be allowed.

ETL process
For each step of the ETL process, data integrity checks should be put in place to ensure that source data is the same as the data in the destination. Most common checks include record counts or record sums.

Access level
We need to ensure that data is not altered by any unauthorized means either during the ETL process or in the data warehouse. To do this, there needs to be safeguards against unauthorized access to data (including physical access to the servers), as well as logging of all data access history. Data integrity can only ensured if there is no unauthorized access to the data.

2 comments:

Unknown said...

b18 Data warehousing to analyze big sets of information for enterprise purposes, smaller companies wishing to review just one subject often utilize "data marts."

Chandrapal Singh said...

Manhattan WMS is now widly using in the ware house management, to know more about manhattan oms, manhattans wms, pkms manhattan, manhattan supply chain software
You can visit official page. http://bit.ly/2zh2Ave