`
hideto
  • 浏览: 2651883 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

HPM Note4, Schema Optimization and Indexing

阅读更多
Choosing Optimal Data Types
Smaller is usually better
Simple is good
Avoid NULL if possible

Indexing Basics
Indexes are implemented in the storage engine layer, not the server layer

B-Tree indexes is the default index type of MySQL
B-Trees store the indexed columns in order, they're usefull for searching for ranges of data
B-Tree indexes work well for lookups by the full key value, a key range, or a key prefix
B-Tree limitations:
1. They are not usefull if the lookup does not start from the leftmost side of the indexed columns
2. You can't skip columns in the index
3. The storage engine can't optimize accesses with any columns to the right of the first range condition

Hash indexes
Hash index is built on a hash table and is useful only for exact lookups that use every column in the index
For each row, the storage engine computes a hash code of the indexed columns, which is a small value that will probably differ from the hash codes computed for other rows with different key values
It stores the hash codes in the index and stores a pointer to each row in a hash table
Hash indexes are the default index type for Memory tables in MySQL
Hash index limitations:
1. Index contains only hash code and row pointers, MySQL can't use the values in the index to avoid reading the rows
2. MySQL can't use has indexes for sorting bc they don't store rows in sorted order
3. Hash indexes don't support partial key matching bc they compute the hash from the entire indexed value
4. Hash indexes support only equality comparisons that use the =, IN(), and <=> operators, they don't support range queries
5. Accessing data in a hash index is very quick, unless there are many collisions(When there are multiple values with the same hash, the storage engine must follow each row pointer in the linked list and compare their values to the lookup value to find the right rows)
6. Some index maintenance operations can be slow if there are many hash collisions
Hash index are useful for lookup tables in a classic "star" schema data-warehousing
The InnoDB storage engine has a special feature called adaptive hash indexes
When InndoDB notices that some index values are being accessed very frequently, it builds a hash index for them in memory on top of B-Tree index
This process is completely automatic, and you can't control or configure it
You can build your own hash index if storage engine does not support it

Spatial(R-Tree) indexes
MyISAM supports spatial indexes, which you can use with geospatial types such as GEOMETRY

Full-text indexes
FULLTEXT is a special type of index for MyISAM tables
Full-text indexes are for MATCH AGAINST operations, not ordinary WHERE clause operations

Indexing Strategies for High performance
Isolate the Column
If you don't isolate the indexed columns in a query, MySQL generally can't use indexes on columns unless the columns are isolated in the query
// these queries can't use index
SELECT actor_id FROM sakila.actor WHERE actor_id + 1 = 5;
SELECT ... WHERE TO_DAYS(CURRENT_DATE) - TO_DAYS(date_col) <= 10;


Prefix Inndexes and Index Selectivity
For long character columns such as BLOB or TEXT, we can index the prefix and make index selectivity close to full column's selectivity

Clustered Indexes
From wikipedia:
Index architectures can be classified as clustered or non-clustered.
Unclustered
An Unclustered index is structured in a way that doesn't correspond to the order of the actual data records. It resembles the words index at the back of a book. For this reason, it will perform worse than clustered indexes on ranged searches where the result set is large, since each result could cost an additional I/O-operation to get the actual data record. One can have any number of these kinds of indexes, since all that is required is the space to store the index itself -- one does not copy the actual data to create a new index.
Clustered
Clustering alters the data block into a certain distinct order to match the index, hence it is also an operation on the data storage blocks as well as on the index. An address book ordered by first name resembles a clustered index in its structure and purpose. The exact operation of database systems vary, but because storing data is very redundant the row data can only be stored in one order. Therefore, only one clustered index can be created on a given database table. Clustered indexes can greatly increase overall speed of retrieval, but usually only where the data is accessed sequentially in the same or reverse order of the clustered index, or when a range of items are selected.
Since the physical records are in this sort order on disk, the next row item in the sequence is immediately before or after the last one, and so fewer data block reads are required. The primary feature of a clustered index is therefore the ordering of the physical data rows in accordance with the index blocks that point to them. Some databases separate the data and index blocks into separate files, others put two completely different data blocks within the same physical file(s).Clusturing is a great example of Indexing and is far more sophisticated than an unclustered index.

Now only solidDB and InnoDB support clustered indexes
InnoDB clusters the data by the primary key
If you don't define a primary key, InnoDB will try to use a unique nonnullable index instead
If there's no such index, InnoDB will define a hidden primary key for you and the cluster on that

Clustered indexes advantages:
1. You can keep related data close together
2. Data access is fast
3. Queries that use covering indexes can use the primary key values contained at the leaf node

Clustered indexes disadvantages:
1. Clustering gives the largest improvement for I/O-bound workfloads. If the data fits in memory the roder in which it's accessed doesn't really matter, so clustering doesn't give much benefit
2. Insert speeds depend heavily on insertion order
3. Updating the clustered index columns is expensive, because it forces InnoDB to move each updated row to a new location
4. Tables built upon clustered indexes are subject to page splits when new rows are inserted, or when a row's primary key is updated such that the row must be moved
5. Clustered tables can be slower for full table scans
6. Secondary(nonclustered) indexes can be larger than you might expect, because their leaf nodes contain the primary key columns of the referenced rows
7. Secondary index accesses require two index lookups instead of one. To find a row from a secondary index, the storage engine first finds the leaf node in the secondary index and then uses the primary key values stores there to navigate the primary key and find the row

An index that contains or covers all the data needed to satisfy a query is called a covering index

An Indexing Case Study
You should put range column to the end of index, because MySQL uses only the leftmost prefix, up to and including the first condition that specifies a range of values
Avoid multiple range conditions
The query can be slow if the user interface is paginated and someone requests a page that's not near the beginning
Denormalizing, precomputing, and caching are likely to be the only strategies that work for queries like this
An even better strategy is to limit the number of pages you let the user view
This is unlikely to impact the user's experience, because no one really cares about the 10,000th page of search results

Index and Table Maintenance
Finding and Repairing Table Corruption
CHECK TABLE
REPAIR TABLE
ALTER TABLE innodb_tbl ENGINE=INNODB;
myisamchk
backup and recovery

Updating Index Statistics
ANALYZE TABLE
SHOW INDEX FROM
INFORMATION_SCHEMA.STATICTICS

Reducing Index and Data Fragmentation
OPTIMIZE TABLE
dump and reload the data


Normalization and Denormalization
Database Normalization

Advantages of Normalized Schema:
1, Normalized upodates are usually faster than denormalized updates
2, When the data is well normalized, there's little or no duplicated data, so there's less data to change
3, Normalized tables are usually smaller, so they fit better in memory and perform better
4, The lack of redundant data means there's less need for DISTINCT or GROUP BY queries when retrieving lists of values

Drawbacks of Normalized Schema:
Any non-trivial query on a well-normalized schema will probably require at least one join, and perhaps several
This is not only expensive, but it can make some indexing strategies impossible

Advantages of Normalized Schema:
1, Everyhing is in the same table, which avoids joins. This can be much faster than a join, because it avoids random I/O
2, A single table can allow more efficient indexing strategies

In the real world, you often need to mix the approaches, possibly using a partially normalized schema, cache tables, and other techniques
The most common way to denormalized data is to duplicate or cache selected columns from one table in anotehr table, you can use triggers to udpate the cached values which makes the implementation easier
You'll often need extra idnexes, redundant fields, or even cache and summary tables to speed up read queries

Speeding Up ALTER TABLE
MySQL's ALTER TABLE performance can become a problem with very large tables, MySQL performs most alterations by making an empty table with the desired new structure, inserting all the data from the old table into the new one, and deleting the old table
ALTER TABLE sakila.film MODIFY COLUMN rental_duration TINYINT(3) NOT NULL DEFAULT 5; // slow
ALTER TABLE sakila.film ALTER COLUMN rental_duration SET DEFAULT 5; // fast. This statement modifies the .frm file and leaves the table alone

Modifying Only the .frm file to convince MySQL not to rebuild the table
CREATE TABLE sakila.film_new LIKE sakila.film
ALTER TABLE sakila.film_new MODIFY COLUMN rating ENUM('G', 'PG', 'PG-13', 'R', 'NC-17', 'PG-14') DEFAULT 'G'
FLUSH TABLES WITH READ LOCK;
// swap the .frm files of table film and film_new
mv film.frm film_tmp.frm
mv film_new.frm film.frm
mv film_tmp.frm film_new.frm
UNLOCK TABLES
DROP TABLE sakila.film_new


Builing MyISAM Indexes Quickly
The usual trick for loading MyISAM tables efficiently is to disable keys, load the data, and reenable the keys:
ALTER TABLE test.load_data DISABLE KEYS
-- load the data
ALTER TABLE test.load_data ENABLE KEYS

This only works for nonunique indexes.
For unique indexes, the faster way:
1, create a table of the desied structure, without any indexes
2, load the data into the table to build the .MYD file
3, create another empty table with the desired structure, including indexes, this will create .frm and .MYI files
4, FLush the tables with a read lock
5, Rename the second tables's .frm and .MYI files, so MySQL uses them for the first table
6, Release the read lock
7, Use REPAIR TABLE to build the table's indexes

Notes on Storage Engines
MyISAM
Table locks
MyISAM tables have table-level locks, be careful this doesn't become a bottleneck
No automated data recovery
If the MySQL server crashes or power goes down, you should check and possibly repair your MyISAM tables before using them. If you have large tables, this could take hours
No transactions
MyISAM table don't support transactions
Only indexes are cached in memory
MyISAM caches only the index inside the MySQL process, in the key buffer. The operating system caches the table's data
Compact storage
Rows are stored jam-packed one after anotehr, so you get a small disk footprint and fast full table scans for on-disk data

Memory
Table locks
Memory tables have table locks but isn't usually a problem because queries on Memory tables are normally fast
No dynamic rows
Memory tables don't support dynamic(variable-length) rows, so they don't support BLOB and TEXT fields at all. Even a VARCHAR(5000) turns into a CHAR(5000) -- a huge memory waste if most values are small
Hash indexes are the default index type
No index statistics
You may get bad execution palns for some complex queries
Content is lost on restart
Memory tables don't persist any data to disk, so the data is lost when the server restarts, even though the table's definitions remain

InnoDB
Transactional
InnoDB supports transactions adn four transaction isolation levels
Foreign keys
As of MySQL 5.0, InnoDB is the only storage engine that supports foreign keys
Row-level locks
Locks are set at the row level, with no escalation and nonblocking selects--standard selects don't set any locks at all, which gives very good concurrency
Multiversioning
InnoDb uses multiversion concurrecncy control, so by default your selects may read stale data
Clustering by primary key
All InnoDB tables are clustered by the primary key, which you can use to your advantage in schema design
All indexes contain the primary key columns
Indexes refer to the rows by the primary key, so if you don't keep your primary key short, the indexes will grow very large
Optimized caching
InnoDB caches both data and memory in the buffer pool, it also automatically builds hash indexes to speed up row retrieval
Unpacked indexes
Indexes are not packed with prefix compression, so they can be much larger than for MyISAM tables
Slow data load
As of MySQL 5.0, InnoDb does not specially optimize data load operations. It builds indexes a row at a time, instead of building them by sorting. This may result in significantly slower data loads
Blocking AUTO_INCREMENT
In versions earlier than MySQL 5.1, InnoDB uses a table-level lock to generate each new AUTO_INCREMENT value
No cached COUNT(*) value
Unlick MyISAM or Memory tables, InnoDB tables don't store the number of rows in the table, which means COUNT(*) queries without a WHERE clause can't be optimized away and require full table or index scans
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics