`

Concept of Key in Data Warehouse

阅读更多

Keys and history
In a star schema, each dimension table is given a surrogate key. This column is a unique identifier, created exclusively for the data warehouse. Surrogate keys are assigned and maintained as part of the process that loads the star schema. The surrogate key has no intrinsic meaning; it is typically an integer. Surrogate keys are sometimes referred to as warehouse keys. The surrogate key is the primary key of the dimension table.
Figure 1-5 A simple star schema for the orders process

In this article, surrogate keys will be easily identifiable by the suffix “_key” in the column name. In Figure 1-5, for example, the surrogate key for the customer dimension is called customer_key, the surrogate key for the salesperson dimension is called salesperson_key, and so forth. Illustrations in this book will always list the surrogate key for a dimension table as its first attribute.
Dimension tables also contain key columns that uniquely identify something in an operational system. Examples in Figure 1-5 include customer_id, sku, and salesperson_id. In the operational systems, these columns identify specific customers, products, and salespeople, respectively. These key columns are referred to as natural keys.
The separation of surrogate keys and natural keys allows the data warehouse to track changes, even if the originating operational system does not. For example, suppose that customer ABC Wholesalers is identified by customer_id 10711 in an operational system.
If the customer changes its headquarters location, the operational system may simply overwrite the address for customer_id 10711. For analytic purposes, however, it may be useful to track the history of ABC Wholesalers. Since the star schema does not rely on customer_id to identify a unique row in the customer dimension, it is possible to store multiple versions of ABC Wholesalers, even though both have the same customer_id—10711. The two versions can be distinguished by different surrogate key values. While it would also be possible to support change tracking by supplementing a natural key with a sequence number, the surrogate key allows fact and dimension tables to be joined based on a single column.
Surrogate Keys and Natural Keys
In the star schema in Figure, surrogate keys for each dimension table are labeled “SK.” Separate and distinct from surrogate keys, one or more natural keys will also be present in most dimension tables.

The natural keys are identifiers carried over from source systems. They may not uniquely identify a row in the data warehouse, but they do identify a corresponding entity in the source system. In above Figure, natural key columns are designated “NK.”
Unlike surrogate key values, the values in natural key columns may have meaning to users of the data warehouse. Even when they do not carry significant meaning, their presence is necessary for the ETL routines that load fact tables.
Sometimes, the natural key for a dimension table consists of more than one column.This occurs when the source system uses a multi-part key to identify the entity. For example,a purchasing contract may be identified by a type code and sequence number. When more than one system can be the source for a dimension, the natural key may be composed of the identifier from the source system and an additional identifier that indicates which source it came from. For example, a bank may have more than one system for deposit accounts after acquiring another bank. The natural key for a customer dimension might,therefore, consist of an identifier used in a source system, in conjunction with a column that indicates the system from which the identifier came.
The use of surrogate keys as unique identifiers allows the data warehouse to respond to changes in source data in whatever manner best fits analytic requirements. Because the dimension table does not rely on the natural key to identify unique rows, it can maintain history even if the source system does not. For example, an order entry system might contain a record for customer_id 404777, which includes the customer’s address. If the system overwrites the address when a customer moves, it is not tracking history. Were the
customer dimension table to use the same customer_id to identify unique rows, it would be able to store only one row for customer_id 404777. It would be unable to maintain the history of the address. By using a surrogate key, it becomes possible to maintain two versions of customer_id 404777. This technique is known as a type 2 slow change.
A possible alternative to the use of a surrogate key is to supplement the natural key with a sequence number. For example, the primary key of the customer dimension table might consist of the customer_id and a version_number column that contains a sequence number. Like the use of surrogate keys, this technique permits the data warehouse to track history independently of the source system, allowing the table to store multiple versions of a customer. This approach provides no value in simplifying the schema design or load
process, however, which must still identify and maintain version history. More importantly,this technique requires multi-part foreign keys to be maintained in the fact table. If customers are identified by customer_id and sequence_no, this pair of columns must be present in order_facts as well. This multi-column foreign key complicates the join process, makes SQL more difficult to read, and in some cases may befuddle efforts to optimize the RDBMS for star join query execution.
Another theoretical alternative to the use of a surrogate key is to supplement a natural key with time stamps. While time stamps may be useful, a multi-part foreign key would still be required in fact tables, potentially leading to the same difficulties as the sequence number. In addition, a designer may be tempted to eliminate multi-column joins by storing only the natural key in the fact table; however, this severely complicates queries and risks error. For example, assume that customer_id 404777 has moved, and therefore has two
rows in the dimension table. Each fact table row contains only the natural key 404777. To identify which version of the customer corresponds to each fact table row, it is necessary to compare order date with the time stamps in the dimension table. This process can be onerous, particularly if one is constructing a report that aggregates a large number of orders. It also becomes difficult for database administrators to tune the system, preventing them, for example, from declaring a foreign key in the fact table and potentially leading to
poor query optimization. Worse, if the date qualification is omitted, facts associated with customer_id 404777 will be double-counted.
It is not common practice to use version numbers or time stamps as part of a unique identifier. Surrogate keys simplify the schema design and allow for clean, single-column joins. Time stamps are frequently included in dimension tables, but not as part of the unique identifier.

 

参考至:《Star Schema The Complete Reference》

如有错误,欢迎指正

邮箱:czmcj@163.com

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics