Tuesday, October 13, 2009

The Dangers of Denormalizing in Dimensional Modeling

When designing data warehouse models, the designer is freed from many of the constraints faced in the past when using the normalization rules of entity relationship modeling. In particular, we are no longer forced to avoid data replication. This article discusses some of the unexpected side effects of denormalization, along with alternative design solutions.

We begin with a brief recent history of modeling from the entity relationship perspective, through the early data warehouse “models” and to the more structured solutions devised by Bill Inmon and Ralph Kimball.

Since denormalizing for dimensional models has side effects, we review those side effects and present multiple alternatives for intelligently creating denormalized designs. These alternatives include continuing to replicate, designing normalized dimension tables, or some combination of separating the attributes into separate dimension tables and combining these tables.

Since our designs are typically very flexible, no one solution is the “right” one, but we discuss the advantages and disadvantages of each.

Among the benefits of data warehouse dimensional models are their inherent simplicity. Using the same tools as traditional entity relationship modeling, the business user can now intuitively understand the structure of the model and find with relative ease database attributes required for reporting and analysis. The dimensional model or star schema also provides the designer with a simple mechanism for training and validation with even the most non-technical staff, since the model is based on business needs and subject areas. Because the model is based on a series of two-table joins, the business can understand both the model and provide feedback early in the design process if there are gaps or inconsistencies. Technically, the dimensional model is still based largely on relational database technology, so the advantages of using the major RDBMS packages and the skill sets required to maintain the physical models are transferred into the new realm. With all the advantages of dimensional modeling, it would appear that there is no downside. However, one side effect as dimensional models become more commonplace is the run-away replication of some attributes. While this appears benign during the design phase, in fact it could cause significant maintenance and reporting concerns by complicating the model. As we move into a more commonly accepted version of denormalizing data models we must be careful to add structure to the model or face a new round of challenges in maintaining them.

Traditional entity relationship modeling followed the tenets of normalization introduced by Dr. E.F. Codd in the 1970s and widely adopted since then. E-R modeling breaks entities (or tables in the physical database) into small sets of distinct attributes (or columns), joining them on unique identifiers (or keys). The final model includes a number of entities, which theoretically have no repeating or redundant values. The biggest value in such a model is that it made the most efficient use of a scarce and expensive resource (disks) and the best technique for inserting, updating, or retrieving data in large online applications such as banking, retail or Web-based transactions. Normalization also brought structure to the earlier data models, which were largely based on a particular application. Figure 1 shows a purchase order model designed using the standard normalization techniques.

Figure 1.

The disadvantage of the ER modeling, however, is that it did not lend itself to querying and reporting. Having multiple table joins meant that reports or ad hoc queries for a particular subject area most likely required joins across several tables. In general, adding successive table joins translates into additional disk reads. Since disk reads are 100 times slower than in-memory reads, a normalized model for millions or even hundreds of thousands of records will slow the response and report run time down by several minutes or hours. Also contributing to the appearance of non-normalized models were the drop in disk price. It was no longer imperative that the model and underlying data be fully normalized since the cost of hardware, particularly among high-end PCs, dropped significantly beginning with the advent of more powerful PCs in the early 1990s. This opened the door to new modeling techniques, particularly those geared directly to reporting. Early versions of reporting, or executive information systems applications tended to throw out any techniques. Since the goal was to speed reporting, early versions of data warehouse models were simply a few large tables with all of the reporting attributes replicated in a single table. If the same attribute was required in another report it was simply replicated in a second table. An example of the purchase order subject area is shown in Figure 2. It soon became apparent, however, that not normalizing was cumbersome, disk intensive, and inflexible from a maintenance and data loading perspective. It was also apparent that it was impossible to add all attributes required for all reporting and ad hoc querying into a single table.


Figure 2.

The next generation of scholars took two distinct routes. Dr. Bill Inmon took the approach that the data warehouse consisted of a repository of historical, clean data from multiple source systems. This repository, or information store, could partially be used directly for detailed reporting but could also be a store from which future data models particularly geared to reporting could be identified and modeled. This approach took the application system or normalized model as the source, identified the subject areas important for reporting, designed the subject area report tables along with their important measures and left the standard descriptive attributes modeled as they appeared in the initial normalized model. The results of this technique appear in Figure 3.


Figure 3.

This technique allowed the warehouse design to remain true to its initial source while placing the design emphasis on getting the correct measures for the business. However, it left the final reporting and analysis database design open ended. The result was that many of the early dimensional models retained the initial normalized flavor. This actually allowed for little or no reporting performance improvements since much of the design emphasis remained on the store of data in the normalized model. As a result, many businesses continued to maintain a highly normalized clean data model separate from, but similar to, the initial source application model. This continued to result in performance issues, since reporting continued to be done against a series of table joins.

Dr. Ralph Kimball’s design method allowed for intelligently and selectively denormalizing the data model while still retaining the measurements inherent in Dr. Inmon’s approach. Central to Dr. Kimball’s design was that since the biggest performance bottlenecks of the normalized model are through the additional table joins and associated disk reads, eliminating the additional joins will speed up reporting. Dr. Kimball’s method also brought structure to the large denormalized tables. Since it is impossible to predict all necessary reporting attributes and replicate these in all tables, the approach followed in this method separates the reporting or descriptive attributes into a separate table and joins this table or these tables through no more than one or several two-table joins back to the measurement table. This keeps the disk-read performance bottlenecks to a minimum, while structuring the large tables set attributes into specific entities. In Dr. Kimball’s star schema, there are no longer multiple normalized versions of the descriptive table but simply a single intuitive denormalized version of the tables (Figure 4).


Figure 4.

Dr. Kimball’s approach has been widely adopted and has proven again and again to be a solid design methodology. Since this article is not intended to be a Kimball versus Inmon discussion, the remainder of the article is related specifically to denormalizing using Dr. Kimball’s star schema design.

The danger when using the star schema design technique, however, has been in the widely adopted approach of designers to ignore the rules of normalization. Although the star schema simplifies the initial model and allows for N two-way joins, it is not intended to create free-for-all unstructured models. Too often, designers have created models that simply replicate attributes across multiple dimension tables. The danger in doing so is that it reduces the simplicity of the model and increases the cost and complexity of maintaining these models. For example, as we can see in Figure 5, the item dimension table includes attributes from the people dimension (planner name) as well as from the organization dimension (organization name). While this may be necessary from a reporting perspective, particularly if the item is related specifically to one or more organizations, it complicates the model by adding redundancy back to the model. If the source application system changes and the extract, transform, and load (ETL) process is not fully documented, the complexity of modifying the load for multiple organization codes or multiple employee names is increased. In a large-scale data warehouse implementation we may face this or similar issues across 20 to 40 dimension tables with hundreds of attributes. These large dimension tables are especially critical in the warehouse since they are what is known as the “conformed dimensions.” Conformed dimensions are those dimension tables that contain the biggest and most critical elements and also appear in several different stars or joins within the data warehouse or across several departmental data marts. Not only are we concerned with replicating data across several dimension tables, we should also be aware that the dimension tables affected by this replication are usually the conformed dimensions which are the most critical to the business.

The typical characteristics of a conformed dimension are as follows:

The table is wide with many attributes

One of the benefits in designing large dimension tables is in the ability of the user to query across several attributes which may not have been apparent during the initial design of the report or reports using the dimension table. For this reason, while the design challenge is harnessing repeating attributes, we should not shy away from including a significant number of attributes in the dimension table in order to avoid this.
The table appears across several or all subject areas or stars

Typically, businesses place a great deal of time and commitment into ensuring that items, customers, and organizational hierarchies are reported on in the same way corporatewide. The resulting conformed dimension tables should be the same for the marketing departmental data mart as it will be for the finance departmental data mart and the manufacturing departmental data mart.
The table contains attributes which are critical for reporting and which usually must be reported on in a standard way

The conformed dimensions are usually those with attributes that are stored in a corporatewide agreed-upon definition and include standard corporate hierarchies. Reporting is usually done on one or several attributes in the conformed dimensions, usually rolling up or down on the hierarchies contained in the table. For example, in the item dimension, the company usually reports on the individual orders by item number, while summarizing on total sales by item groups or families.

All of these contribute to the complexity of the issue. This also adds complexity to the front-end reporting tool, whether by increasing the number of common elements into a front-end universe design, or by requiring the report developer to understand all the common attributes in the model and where best to derive them from for reporting purposes. In conjunction with this, if the ETL process correctly changes an attribute in one dimension table while leaving another common attribute unchanged in a second dimension table, it may be possible for two report developers to write conflicting reports.

Although the flexibility of dimensional modeling frees us from the restrictions of traditional ER modeling, in order to gain the full benefit of the simpler model we still need to follow techniques to apply denormalization in a consistent manner. Essentially, we can do one of the following:

Selectively ignore the data replication issue and continue to design large dimensions with repetitive attributes.
Maintain a unique normalized staging area/operational data store for use as a single source.
Maintain normalized or snowflaked dimension tables and join these tables where applicable.
Design models which denormalize attributes into a single denormalized table.
Continue to design with repetitive attributes. Despite the disadvantages stated above, this technique has the benefit of allowing the business user a wealth of reporting options without having to select from multiple dimension tables. In our example, one specific and potentially critical ad hoc query which is answerable without the need for multiple table joins is “What are the item groupings/families by item planner.” Another query may be planner and product counts in order to gauge the business impact to a planner transfer. Also critical in the particular example above is that the item dimension truly exists as a unique item under various organizations. By separating the organization into a separate dimension table, the ad hoc query user may understate the number of distinct item numbers since the combined item number/organization ID are not considered. This is a good argument in favor of using this method. If both the ETL process and the front-end meta data layer are well documented, the inconsistencies involved in replicating attributes should also be minimal. However, this design method should be used selectively.

Figure 5.

Maintain a unique normalized staging area/operational data store as single source. This is actually similar to Dr. Inmon’s approach. This schema would continue to be normalized but may be slightly different than the initial source system model(s), both because the staging area would typically contain multiple source systems and because the “fact” table measures would still be rolled into specific source staging tables.

The biggest benefit of this approach is that in the event that a change is made to the underlying source system, the ETL process would simply modify a single attribute from the source to staging area, rather than managing multiple changes across multiple dimension tables in the data warehouse ETL process. As the source for the dimensional data warehouse, the staging area would still retain the new correct value. The dimensional model could conceivably remain as in Figure 5 with the attributes replicated across one or more dimension tables.

The biggest disadvantage of this method is that businesses rarely view the cost of the additional storage and hardware required for a staging area or operational data store to be of enough importance to allow for this to be incorporated into the data warehouse design. Too often, the design of a clean operational data store is left out of the overall design because of this restriction and data continues to be loaded into the dimensional model directly from the source system or systems. A second disadvantage is that this approach does not resolve the business reporting complexity that replicating dimensional attributes brings to the business community.

Figure 6.

Maintain normalized or snowflaked dimension tables. This was a widely used approach in earlier data warehouse models and is still in use today. Theoretically, since an item dimension in the source system is based on several related tables, many early dimensional models retained this association. By retaining the normalized approach for dimension tables in the dimensional model, we solve the repetitive attributes issue. One wrinkle that we add to the model is that we still need to join the item dimension to the personnel dimension and the organization dimension. This appears to be benign until we realize that, like the model in Figure 5, the person dimension and organization dimension are also joined directly to the fact table using different roles. The resulting star schema appears in Figure 6.

While this approach solves our ETL maintenance issue, it actually complicates the reporting and ad hoc view of the data significantly, depending on the reporting tool being used. An additional side effect is that performance could continue to be an issue due to the addition of multiple table joins from the fact table through to the outer most dimension table. Due to the additional reporting complexity and to the performance issues, this approach is not recommended.

Denormalize attributes into a single denormalized dimension table. The design flaw in Figure 5 is due to the fact that the dimension tables have not been denormalized but are simply non-normalized tables. The distinction between the two is subtle but important. When denormalizing, we are actually taking the normalized model and reversing the normalization into a single table. However, time and again, we replicate or denormalize attributes indiscriminately, effectively leading to a non-normalized model. A denormalized version of Figure 5 appears in Figure 7.

This does, however, create an interesting training issue or reporting requirement. Since, in our example, items are related to item planners and are specific to organizations, when we do not replicate these attributes in the item dimension, we risk misreporting the item information. We are left with the following choices:

Creating a relationship between the product dimension and other dimensions through a factless fact table.

Or
Replicating the planner and organization information by joining the organization and employee dimensions to each applicable fact table.
Creating a relationship between the product dimension and other dimensions through a factless fact table. This option appears in Figure 8. To satisfy the reporting questions such as who is the current planner for the item or which organizations do item xyz belong to, we create the association through a second factless fact table. Doing so allows us to maintain the denormalized model while allowing the business to report on dimensional relationships. Conceptually, this also allows us theoretically to capture these relationship changes over time, effectively allowing the business to track costly trends in planners or organization changes.

Figure 7.

The biggest drawback of this design is that it does require the business to understand a separate relationship and to know how to use it. It also does not solve our intrinsic problem which was to report on the combined item, organization, and planner relative to a subject area. We discuss this below.

Replicating the planner and organization information by joining dimensions to each applicable fact table. The design in Figure 9 allows us to report on all of the significant attributes in the three dimensions relative to this subject area. In this design, we include the organization and planner as additional foreign key joins to the existing fact table. This has the advantage of ensuring that these relationships are clearly enough defined for the business community to use them appropriately. The biggest disadvantage to this method is the additional foreign key joins but due to the fact that the foreign keys are numeric and are typically indexed, the performance impact should be negligible.

Figure 8.

Therefore, there are many options available when designing dimensional models. Most designs have benefits and hidden inconsistencies and it is rare that a single solution will work for every design. Possibly the most important component in the final design is to understand the drawbacks and advantages of each option described above and the long-term results of making a particular design decision over the long-term life of the warehouse.

Figure 9.

No comments: