`

《Pro Oracle SQL》--Chapter 5--5.5 Questions about Data

阅读更多

Questions about Data  关于数据的问题 (page 134)
    I hope at this point you agree that you do need to concern yourself with how data is stored and how it should be accessed.  Where do you find this information?  The database can give you most of the answers you need by executing a few simple queries.  Once you have this data, you then need to determine how data should be accessed.  This comes from understanding how the various access and join methods work and when it is appropriate to use each.  I’ve already covered access and join methods, so you’ve got the information you need to help you there.  But how do you discover how the data is stored?  Let’s walk through the questions you need to ask and queries you can execute to get the answers.
    我希望现在起你开始关心数据是如何存储及访问了。你能从哪获得这些信息?你只要执行几个简单的查询数据库就能给你大多数答案。一旦你有了数据,你就判断数 据是如何访问的。这需要理解各种访问和链接方法是如何工作的以及何时适当的使用它们。我已经讲解过访问和连接方法,你可以在那(第四章)获得这些(知识) 信息。但是你如何发现数据是怎么存储的?让我们一起过一下你需要问的问题以及演示运行能获得答案的查询。
    As a first step, try to think like the optimizer would.  The optimizer needs statistics and instance
parameter values to be able to compute a plan.  Therefore, it’s a good idea for you to put yourself in the optimizer’s place and gather the information that will help to formulate the execution plan.  You should always seek out the answers to the following questions about the data:
•  Which tables will be needed to gather all the data required?
•  Are any of the tables partitioned, and if so, how are the partitions defined?
•  What columns are in each table?
•  What indexes are available in each table?
•  What are the statistics for each table, column, and index?
•  Are there histograms on any of the columns?
     第一步,像优化器一样思考。优化器需要统计信息和实例参数值来计算(生成)一个(执行)计划。因此,你设身处地的站在优化器的立场上收集能计算执行计划信息是聪明的想法。你应当总是查找下面关于数据的问题的答案:
• 为了收集所有需要的数据要用到哪些表?
• 有哪些表分区了?如果是这样,分区是如何定义的?
• 每张表有哪些列?
• 每张表,每个列,每个索引的统计信息有哪些?
• 有哪些列分簇了?
    Statistics help the optimizer paint a picture of how the various ways of accessing and joining data
will perform.  You can know what the optimizer knows.  All you need to be able to do is query the
information from the data dictionary.  One thing to keep in mind when you’re reviewing statistics is that statistics may or may not accurately represent your data.  If the statistics are stale, missing, or poorly collected, it’s possible that they paint the wrong picture.  The optimizer can only know what the
statistics tell it.  You, on the other hand, have the ability to determine if the statistics make sense.  For
example, if a date column in one of your tables has a high value of six months ago, you can quickly see
that and know that rows exist with current date values.  That visual inspection can help you determine if statistics need to be updated.  But you can’t know these kinds of things unless you look.  A key question you must always ask is whether or not the statistics accurately represent your data.  Listing 5-3 uses a single script named st-all.sql (previously used in Chapter 2) to answer each of the questions listed above in one simple script. It gives you a single source to review to verify how representative the available statistics really are.
    统计信息帮助优化器绘制出一幅如何访问和连接数据的运行图。你能知道优化器知道什么。你都能查询数据字典获得这些信息。请记住一件事:你查看的统计信息可能,也可能不代表你的数据。如果统计信息过时了,遗漏了,或者收集不良。有可能就会绘制错误的(运行)图。优化器只知道统计所告诉它的信息。而你,从另一 反面,有能力判断统计是否合理。例如,如果你的表的一个数据列有一个6个月前的高位值,你能快速的看出存在当前日期的行。目视检查能帮你确定是否统计需要 更新。除非你检查才知道这种事。一个关键的你需要不断问的问题是:统计是否准确的代表你当前的数据。列表5-3用一个名叫st-all.sql的简单脚本 (之前在第二章使用过)来解答所有上面列出的问题。它作为单一来源让你去复习,确认有效的统计实际上是如何具有代表性的。
SQL> @st-all
Enter the owner name: sh
Enter the table name: sales
======================================================================================
  TABLE STATISTICS
======================================================================================
Owner         : sh
Table name    : sales
Tablespace    : EXAMPLE
Partitioned   : yes
Last analyzed : 09/03/2010 20:17:03
Sample size   : 918843
Degree        : 1
# Rows        : 918843
# Blocks      : 1769
Empty Blocks  : 0

Avg Space     : 0
Avg Row Length: 29
Monitoring?   : yes
======================================================================================
  PARTITION INFORMATION
======================================================================================
 Part# Partition Name      Sample Size          # Rows        # Blocks 
------ --------------- --------------- --------------- --------------- 
     1 SALES_1995      .                             0               0 
     2 SALES_1996      .                             0               0 
     3 SALES_H1_1997   .                             0               0 
     4 SALES_H2_1997   .                             0               0 
     5 SALES_Q1_1998             43687           43687              90 
...
    28 SALES_Q4_2003   .                             0               0 
 
 Part# Partition Name  Partition Bound                                                
------ --------------- -------------------------------------------------------------
     1 SALES_1995      TO_DATE(' 1996-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ... 
     2 SALES_1996      TO_DATE(' 1997-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ... 
     3 SALES_H1_1997   TO_DATE(' 1997-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ... 
     4 SALES_H2_1997   TO_DATE(' 1998-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ... 
     5 SALES_Q1_1998   TO_DATE(' 1998-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ... 
...

    28 SALES_Q4_2003   TO_DATE(' 2004-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', ...     
======================================================================================   
  COLUMN STATISTICS
======================================================================================
 Name           Null?  NDV      Density  # Nulls   # Bkts  AvgLen  Lo-Hi Values
======================================================================================
amount_sold     N      3586     .000279  0         1       5       6.4 | 1782.72
channel_id      N      4        .250000  0         1       3       2 | 9
cust_id         N      7059      .000142  0         1       5       2 | 101000
prod_id         N      72       .000001  0         72      4       13 | 148
promo_id        N      4        .000001  0         4       4       33 | 999
quantity_sold   N      1       1.000000  0         1       3       1 | 1
time_id         N      1460     .000685  0         1       8       01/01/1998 00:00:00 | 
                                                                   12/31/2001 00:00:00

======================================================================================
  HISTOGRAM STATISTICS     Note: Only columns with buckets containing > 5% are shown.
======================================================================================
 
PROMO_ID (4 buckets)
1 97%
 
======================================================================================
  INDEX INFORMATION
======================================================================================
 
Index Name                                Dstnct  Lf/Blks Dt/Blks Cluf Unq? Type Part? 
                    BLevel Lf Blks # Rows   Keys  /Key    /Key
------------------ ------- ------- ------ ------ ------- ------- ----- ---- ---- ----- 
SALES_CHANNEL_BIX        1      47     92      4      11      23    92 NO   BITM YES   
SALES_CUST_BIX            1     475  35808   7059       1       5 35808 NO   BITM YES   
SALES_PROD_BIX           1      32   1074     72       1      14  1074 NO   BITM YES   
SALES_PROMO_BIX          1      30     54      4       7      13    54 NO   BITM YES   
SALES_TIME_BIX           1      59   1460   1460       1       1  1460 NO   BITM YES   
 
Index Name                           Pos# Order Column Name
------------------------------ ---------- ----- ------------------------------
sales_channel_bix                       1 ASC   channel_id
sales_cust_bix                          1 ASC   cust_id
 
sales_prod_bix                          1 ASC   prod_id
 
sales_promo_bix                         1 ASC   promo_id
 
sales_time_bix                          1 ASC   time_id
 
======================================================================================    
  PARTITIONED INDEX INFORMATION
======================================================================================    
 
Index: SALES_CHANNEL_BIX
                                           Dst LfBlk DtBlk 
Part# Partition Name  BLevel LfBlks # Rows Keys /Key  /Key CluF Partition Bound
----- --------------- ------ ------ ------ ---- ----  ---- ---- ----------------------
    1 SALES_1995           0      0      0    0    0     0    0 TO_DATE('1996-01-01...
    2 SALES_1996           0      0      0    0    0     0    0 TO_DATE('1997-01-01...
    3 SALES_H1_1997        0      0      0    0    0     0    0 TO_DATE('1997-07-01...
    4 SALES_H2_1997        0      0      0    0    0     0    0 TO_DATE('1998-01-01...
    5 SALES_Q1_1998        1      2      5    4    1     1    5 TO_DATE('1998-04-01...
...
   28 SALES_Q4_2003        0      0      0    0    0     0    0 TO_DATE('2004-01-01...
                                                                                  
Index: SALES_CUST_BIX                                                              
                                                                                  
    1 SALES_1995           0      0      0    0    0     0    0 TO_DATE('1996-01-01...
    2 SALES_1996           0      0      0    0    0     0    0 TO_DATE('1997-01-01...
    3 SALES_H1_1997        0      0      0    0    0     0    0 TO_DATE('1997-07-01...
    4 SALES_H2_1997        0      0      0    0    0     0    0 TO_DATE('1998-01-01...
    5 SALES_Q1_1998        1     28   3203 3203    1     1 3203 TO_DATE('1998-04-01...
...
   28 SALES_Q4_2003        0      0      0    0    0     0    0 TO_DATE('2004-01-01...
                                                                                  
Index: SALES_PROD_BIX                                                             
    1 SALES_1995           0      0      0    0    0     0    0 TO_DATE('1996-01-01...
    2 SALES_1996           0      0      0    0    0     0    0 TO_DATE('1997-01-01...
    3 SALES_H1_1997        0      0      0    0    0     0    0 TO_DATE('1997-07-01...
    4 SALES_H2_1997        0      0      0    0    0     0    0 TO_DATE('1998-01-01...
    5 SALES_Q1_1998        1      2     60   60    1     1   60 TO_DATE('1998-04-01...
...
   28 SALES_Q4_2003        0      0      0    0    0     0    0 TO_DATE('2004-01-01...
                                                                                  
Index: SALES_PROMO_BIX                                                            
                                                                                  
    1 SALES_1995           0      0      0    0    0     0    0 TO_DATE('1996-01-01...
    2 SALES_1996           0      0      0    0    0     0    0 TO_DATE('1997-01-01...
    3 SALES_H1_1997        0      0      0    0    0     0    0 TO_DATE('1997-07-01...
    4 SALES_H2_1997        0      0      0    0    0     0    0 TO_DATE('1998-01-01...
    5 SALES_Q1_1998        0      1      3    2    1     1    3 TO_DATE('1998-04-01...
...
   28 SALES_Q4_2003        0      0      0    0    0     0    0 TO_DATE('2004-01-01...
                                                                                  
Index: SALES_TIME_BIX                                                             
                                                                                  
    1 SALES_1995           0      0      0    0    0     0    0 TO_DATE('1996-01-01...
    2 SALES_1996           0      0      0    0    0     0    0 TO_DATE('1997-01-01...
    3 SALES_H1_1997        0      0      0    0    0     0    0 TO_DATE('1997-07-01...
    4 SALES_H2_1997        0      0      0    0    0     0    0 TO_DATE('1998-01-01...
    5 SALES_Q1_1998        1      3     90   90    1     1   90 TO_DATE('1998-04-01...
...
   27 SALES_Q3_2003        0      0      0    0    0     0    0 TO_DATE('2003-10-01...
   28 SALES_Q4_2003        0      0      0    0    0     0    0 TO_DATE('2004-01-01...
    With this information, you can answer almost any question about the data.  It is best if these
statistics are from your production database where the SQL you are writing will be executed.  If your
development database doesn’t have a copy of the production statistics, it’s a good idea to request that the production stats be imported into the development database so that the optimizer is formulating plans based on information that is as close to production as possible.  Even if the data doesn’t match, remember that it’s the statistics that the optimizer uses to determine the plan.
   通过上面的信息,你可以解答几乎任何关于数据的问题。如果这些统计来自于你所写SQL执行的生产数据库那将是最好的。如果开发数据库没有复制生产数据库的统计。最好是请求将生产数据库统计导出到开发数据库中这样优化器就能基于尽可能接近生产数据库的信息来计算生成计划。即使数据不匹配,也要记住优化器使 用统计去决定计划。
    Now that you’ve obtained the statistics, you can use the information to ask, and answer, questions
about what you’d expect the optimizer to do with your SQL.  For example, if you were writing a query
that needed to return all sales data for a specified customer (cust_id), you might want to know how
many rows the optimizer will estimate the query to return.  With the statistics information you have
queried, you could compute the number of rows estimated to be returned by the query to be 130
(918,843 total rows x 1/7,059 distinct values).  You can see that there is an index on cust_id, so the
proper access operation to use to satisfy the query should be the SALES_CUST_BIX index.  When you
execute the query, you can verify this operation is selected by checking the execution plan.
    既然你获得了统计信息,你能用这些信息,思考优化器对你的SQL做什么优化。例如,如果你写一个查询需要返回针对某一顾客(customer)所有销售记 录,你可能想要知道优化器估计查询查询返回多少行记录。有了刚查出的统计,你就可以算出(优化器)估计查询返回的条数是130行(总数918,843条 /distinct值7059)。你可以看到这里有一个索引在cust_id上,所以满足查询的适当访问操作应该(使用)SALES_CUST_BIX索 引。当你执行查询,你就能通过检查执行计划确认这个操作被选择了。
     In Chapter 3, I discussed the index statistic called clustering factor.  This statistic helps the optimizer compute how many blocks of data will be accessed.  Basically, the closer the clustering factor is to the number of blocks in the table, the fewer the estimated number of blocks to be accessed when using the index will be.  The closer the clustering factor is to the number of rows in the table, the greater the estimated number of blocks will be.  The fewer blocks to be accessed, the lower the cost of using that index and the more likely it is that the optimizer will choose that index for the plan.  Therefore, you can check this statistic to determine how favorable the index will appear.  Listing 5-4 shows the clustering factor statistics for the SALES table.
   在第三章,我讨论了索引统计被称为簇因子(clustering factor). 这个统计帮助优化器计算有多少个块将被访问。基本上,簇因子越接近表的块数,估计的访问块数就越小,将使用索引。簇因子越接近表的行数,估计的访问块数就 越大。访问的块越少,使用索引访问的成本就越低,就越有可能优化器将在执行计划中使用索引。因此,你能检查这个统计信息来判断使用索引的可能性有多大。列 表5-4显示了SALES表的簇因子统计信息。
Listing 5-4. Index clustering_factor
SQL> select t.table_name||'.'||i.index_name idx_name,
 2          i.clustering_factor, t.blocks, t.num_rows
 3     from user_indexes i, user_tables t
 4    where i.table_name = t.table_name
 5      and t.table_name = 'SALES'
 6    order by t.table_name, i.index_name;
 
IDX_NAME                 Clustering Factor        # Blocks          # Rows
------------------------ ----------------- --------------- ---------------
SALES.SALES_CHANNEL_BIX                 92            1769          918843
SALES.SALES_CUST_BIX                 35808            1769          918843
SALES.SALES_PROD_BIX                  1074            1769          918843
SALES.SALES_PROMO_BIX                   54            1769          918843
SALES.SALES_TIME_BIX                  1460            1769          918843
 
5 rows selected.

     In this case, the clustering factors for all of the indexes for the SALES table have a low value (i.e.
closer to the number of blocks in the table).  That is a good indication that when the optimizer computes the cost of using these indexes, they will not be weighted too heavily based on the estimated number of blocks they will return if used.
    在这个例子中,SALES表的所有索引的簇因子都是低位值(例如,接近表的块数)。当优化器计算使用这些索引的成本时,这是一个好的指标。如果使用索引,基于估计的它们返回的块数,成本不会评估的很高。
    In addition to using statistics, you can execute actual queries against the tables to get an idea of the data and number of rows that will be accessed or returned from a single table.  Regardless of how
complex a statement is, you can do just what the optimizer would do and break the statement down into single table accesses.  For each table involved, simply execute one or more queries to count and review the data that would be returned using the filter conditions your SQL will use.  As discussed previously, always think “divide and conquer.”  Breaking a statement down into small increments will help you understand how best to put it together in the most efficient way to arrive at the final result.
    除了使用统计,你能通过实际查询表来获得从单表访问或返回的数据和行数的印象。无论多复杂的语句,你可以做的像优化器一样,分解它成一个个单表访问语句。 对于涉及的每张表,简单的执行一个或多个查询去count和检查被你的SQL使用的过滤条件返回的数据。正如之前讨论的,不断的考虑“分而治之”。将一个 语句拆分成小块,将帮助你理解如何用最有效的方式组合它们成为最佳结果。

1
2
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics