`

Dimension Table Features(原创)

阅读更多

Rich Set of Dimensions
Dimensions provide context for facts. Without context, facts are impossible to interpret. For example, I might tell you, “Order dollars are $40,000.” This statement is of no use to you. It presents a fact (order dollars) without any explanation. Is this orders for one product or all products? Is it one day’s worth of orders or one year’s? To understand what “$40,000” means, you need more information. “Order dollars were $40,000 for electronic products in January 2009.” By adding dimensional context—a product category, a month, and a year—the fact has been made useful.
Generally, dimensions and their values add meaning in many ways:

  • They are used to filter queries or reports.
  • They are used to control the scope of aggregation for facts.
  • They are used to order or sort information.
  • They accompany facts to provide context on reports.
  • They are used to define master–detail organization, grouping, subtotaling, and summarization.

Put to these uses, dimensions unlock the value of facts. Dimensions can be added to queries in different combinations to answer a wide variety of questions. The larger the set of dimension attributes, the more ways that facts can be analyzed.

Tip: Provide a rich and comprehensive set of dimension attributes. Each new attribute dramatically increases the number of analytic possibilities.
Dimension tables with a large number of attributes maximize analytic value. They can be thought of as wide. If you listed their contents, each row would be quite long. In addition to storing common attributes, dimension tables store commonly used combinations of attributes. Codes may be supplemented with corresponding description values. Flags are translated from Boolean values into descriptive text, and multi-part fields are both preserved and broken down into constituent pieces. It is also important not to overlook numeric attributes that can serve as dimensions.

Common Combinations
In operational systems, it is common practice to break data elements down to constituent parts whenever possible. From these components, it is possible to construct any combinations that may be needed. For example, customer name may be broken down and stored as a first name, middle initial, and last name. These attributes can be combined to produce a full name, if needed.
These components have analytic value and, of course, will be included in a dimensional design. Unlike the operational schema, however, the dimensional schema should also include dimensions that represent common combinations of these elements. For example, if a large number of reports group information by full name, or in a last-name-first format, then these common usages should also appear as discrete dimension columns. This principle can be seen at work in Figure 3-2.
The customer dimension in Figure 3-2 shows how the three components of a customer name, such as John P. Smith, may be used to construct five attributes in the customer dimension table. Because each of these attributes is made available in the dimensional design, it is easy to use them to organize data, sort reports, order data, and so forth. Since they are not calculated in queries, it is possible for database administrators to index these columns, providing for efficient query performance.

Codes and Descriptions
In operational systems, it is common for the list of appropriate values in a domain to be described using codes. Elsewhere, a separate table is used to provide the corresponding descriptions. Often called reference values or lookup values, these descriptions may be more useful than the codes themselves. For example, a source table that stores order information might capture the type of customer. Rather than store various values, such as “Direct,” “Indirect,” or “Other,” the table only stores codes such as 001, 002, or 003. A separate reference table maps these codes to the description values. This facilitates maintenance of the appropriate list of values and streamlines storage.
From an analytic perspective, both the code and description are useful dimensions. For example, Figure 3-2 shows a type_code of 002 transformed into a pair of attributes in the dimensional design: one for the code and one for the description. Because the dimension table carries both, users are able to filter, access, and organize information in whatever way they see fit.

Flags and Their Values
Columns whose values are Boolean in nature are usually referred to as flags. In an operational system, these values may be stored in several ways. One method uses a column with a Boolean data type. Another method uses an integer, which will contain only the values 0 or 1, or a character, which will contain only the values “Y” or “N.” Some systems employ a special case of a code with two possible values: one indicating “True” and the other indicating “False.” In Figure 3-2, the source column credit_order_flag contains a “Y” for credit orders and an “N” for noncredit orders.
In a dimensional design, these flags may be used to filter queries or group facts. By storing a descriptive value for the flag, we make using the flag easier. For example, a report can break up orders into “Credit Order” and “Not Credit Order” categories. These descriptors are far more useful than 0/1 or Y/N, and can also be used less ambiguously when defining a query predicate or filter.

Multiple-Part Columns
Operational systems often contain attributes that have multiple parts, each part bearing some sort of significance. Account codes are a common example, made up of parts such as a company identifier, account code, subaccount code, and so forth. In a dimensional design, the entire attribute may be stored, along with additional attributes that isolate its constituent parts. If these subcomponents are codes, they may also be accompanied by corresponding description values.
In Figure 3-2, the operational system records a region code in the format XX-YYY. The first part of this code designates a country, and the second part designates a territory within that country. The value 07-701, for example, contains country code 07 and territory code 701, which correspond to the United States and East, respectively. The dimensional design contains the full code, as well as the constituent codes and their corresponding descriptions.

Dimensions with Numeric Values
While the majority of dimensions contain data that is textual, sometimes dimensions contain numeric data. Given that facts tend to be numeric, this can occasionally lead to confusion. You have already seen examples of dimensions that contain numeric data. In Figure 3-2, for example, numeric content may be found in customer_type_code, country_code, and territory_code. Other common examples of numeric data elements are sizes, telephone numbers, and Zip codes. All of these examples are clearly dimensions. They will be used to provide context for facts, to order data, to control aggregation, or to filter query results.
Some numeric attributes are less easy to identify as dimensions. For example, the unit price associated with an order is numeric. If 100 widgets are sold at $10 a piece, is the $10 unit price a fact or a dimension? If an attribute is commonly aggregated or summarized, it is a fact. If it is used to drive aggregations or summarizations, however, it is a dimension. In the case of a unit price, it is not useful to sum unit prices across multiple orders. On the other hand, it is useful to group orders by unit price, perhaps to answer the question, “How many did I sell at $10 each versus $12 each?” The unit price is, therefore, behaving as a dimension.

Tip It is not always clear whether a numeric data element is a fact or a dimension. When in doubt, pay close attention to how it will be used. If the element values are used to filter queries, order data, control aggregation, or drive master–detail relationships, it is most likely a dimension.

While unit amounts are dimensions, extended amounts are facts. As you will see later, multiplying a unit amount by the transaction quantity produces a value that can be aggregated or summarized. The unit amount is a useful dimension, and the extended amount is a useful fact. Both have their place in the dimensional design.

Benefits of Redundancy
Newcomers to dimensional design are often repulsed by redundant storage of multiple part codes and their components, storage of common combinations of elements such as names, or the expansion of flag values from Boolean indicators into descriptive text. These techniques can seem wasteful. Why not simply perform these computations “on the fly” at query time? The precomputation and storage of these redundant data elements have three advantages in an analytic environment:
performance, usability, and consistency.
Recall that dimensions are used to specify query predicates, define the scope of aggregation, govern the ordering of data, and establish master–detail structure in reports. If these elements must be computed at query time, performance will be significantly impacted. Precomputing and storing these columns reduces the burden on the DBMS at query time, allows the database administrator to optimize performance through the use of indexes and other techniques, and opens the door to more efficient query execution.
Second, this level of redundancy makes it much easier for users to interact with the analytic database. It is far easier for the end user to select data where the country is “United States” rather than look for records where the first two digits of region_code are 07. If a report calls for customer names to appear in a last-name-first format, it is much easier to select that in a single column than it is to concatenate multiple fields together, omitting the space after first name if there is no middle initial, and so forth. If a user wants to break out credit orders, it is much easier to have a flag that reads “Credit Approved” or “Credit Not Approved” than it is to translate a Boolean value. A business intelligence tool may provide some of this functionality but does not offer the simultaneous performance benefits.
Last, explicit storage of all dimensions guarantees that they are consistent, regardless of the application being used. A business intelligence tool can be used to make it easier to provide full names, isolate substrings, or decode flags, but these shortcuts are only available to users who are interacting with the analytic database through that tool. If there are other tools in use, or if report developers are writing their own SQL, there is the danger that these dimensions are not being constructed consistently.
Tip Embrace redundancy in the dimensional model. Explicit storage of each dimension attribute maximizes performance, aids usability, and guarantees consistency across applications. The rules of normalization are best suited to the needs of operational systems, not analytic systems.
None of this should be interpreted as an argument against the use of business intelligence tools. In fact, these tools are extremely valuable components of the data warehouse architecture. Using these tools in an effort to save space, however, is misguided. Storing dimensions explicitly ensures performance, usability, and consistency. Computing dimensions on the fly saves a few dollars worth of disk space.

 

参考至:《Star Schema The Complete Reference》

本文原创,转载请注明出处,作者

如有错误,欢迎指正

邮箱:czmcj@163.com

0
1
分享到:
评论

相关推荐

    Informatica PowerCenter8 Lab 中文版 实验手册

    Unit 9 Lab A: Load Promotions Dimension Table (Lookup and Persistent Cache) 105 Unit 9 Lab B: Features and Techniques II 113 Unit 10 Lab: Reload the Employee Staging Table 120 Unit 11 Lab: Load ...

    Big.Data.NoSQL.Architecting.MongoDB.epub

    Big Data and NoSQL technologies are taking the industry to a newer dimension. Big Data, MongoDB not only enables the user in understanding the buzz words “Big Data” and “NoSQL”, it also delves in...

    PostGIS cookbook

    Key Features Import and export geographic data from the PostGIS database using the available tools Maintain, optimize, and fine-tune spatial data for long-term viability Utilize the parallel support ...

    DevExpress VCL 12.2.6(v2012vol2.6) 源码-例子-帮助-part1

    Q487923 - OLAP mode - Empty data cells are displayed when a dimension includes a large number of grouping values Q487520 - OLAP mode - The "List index out of bounds" exception when expanding a ...

    DevExpress VCL 12.2.6(v2012vol2.6) 源码-例子-帮助-part2

    Q487923 - OLAP mode - Empty data cells are displayed when a dimension includes a large number of grouping values Q487520 - OLAP mode - The "List index out of bounds" exception when expanding a ...

    数位板压力测试

    The availability of drivers that support the features of the specification will simplify the process of developing Windows appli¬cation programs that in-corporate absolute coordinate input, and ...

    The Princeton Companion to Mathematics

    * Features nearly 200 entries, organized thematically and written by an international team of distinguished contributors * Presents major ideas and branches of pure mathematics in a clear, accessible ...

    postGIS 用户手册

    6.3 Avoiding dimension conversion . . . . . . . . . . . . . . . . . 50 6.4 Tuning your configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.4.1 Startup . . . . . . . . . . . . . . . ...

    LTE from A to Z (part1 of 3)

    Table of Content Principles and Motivation of LTE............................................1 1.1 Mobile Radio: Comparison between 3G and 4G..................2 1.1.1 Performance and Mobility ...

    LTE from A-Z 培训教材 (part2 of 3)

    Table of Content Principles and Motivation of LTE............................................1 1.1 Mobile Radio: Comparison between 3G and 4G..................2 1.1.1 Performance and Mobility ...

    LTE from A to Z

    Table of Content Principles and Motivation of LTE............................................1 1.1 Mobile Radio: Comparison between 3G and 4G..................2 1.1.1 Performance and Mobility ...

Global site tag (gtag.js) - Google Analytics