Audiogalaxy high performance MySQL search engine

masterkey

浏览: 331036 次
性别:
来自: 北京

最近访客更多访客>>

cutesunshineriver

春春哥哥

hejianhua66

jobury

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

技术

performance MySQL Cache Web PHP

As I mentioned before, search was one of most interesting problems I worked on at Audiogalaxy. It was one of the core functions of the site, and somewhere between 50 to 70 million searches were performed every day. At peak times, the search engine needed to handle 1500-2000 searches every second against a MySQL database with about 200 million rows. It was frequently hard to design for more than 10 or 100x our current traffic (and our growth was so dramatic that there wasn’t really ever time to spend more than a few weeks on the problem), so it wasn’t until the 4th major iteration on the cluster design that I really considered the scale problems to be solved.

Here is a diagram of the architecture I ended up with (component explanations are below):

Arrows indicate the direction the request flows for any request/response pairs. What you see here is the result of a few years of evolutionary improvements, not a single top-down design. Generally, things changed when I had a few weeks to focus on search. It was enough time to make improvements but rarely enough time to scrap everything and start over.

Satellite Servers (300+ servers, written in C)
The file sharing clients (the satellites) kept connections open to the satellite servers, and these servers were the ultimate authority about the availability of files. Information about availability was used to sort results immediately before they were displayed to the user.

Gateways (4 servers, written in C)
Gateways were necessary whenever a cache server or a web server needed to query anything in the satellite server cluster. Because there were so many satellite servers, it wasn’t feasible to query all of them in a reasonable amount of time. To address this problem, the gateways helped aggregate information and route requests.

For processing searches, the gateways were the source of the “current availability” for every search result. The availability was used to help sort the results — we wanted to show files that were currently available higher in the search results than files that were not. Gateways could respond to these queries without querying any other servers, so it was feasible to query for large numbers of results at once.

Caches (4 servers, written in C)
I called them caching servers, but they ultimately did most of the heavy lifting for each search in addition to caching the results. They accepted queries and paging offsets from the web servers, cached results from the index servers to disk, queried the databases and gateways for sorting information, and returned a set of results. In earlier versions of the search, results were not cached by a dedicated service, and most of the processing was done in PHP on the web servers. Manipulating large arrays of results in PHP was a huge perf issue, but it was initially nice to be able to tweak the search algorithm in a scripting language without dealing with recompiling and redeploying.

These servers were also used to sort results by relevance. Results were sorted twice. The first sort was by historical availability, and it occurred when a result set (which could have been many thousands of rows) was initially cached to disk. The historical availability was stored in a database and was updated daily by adding the current availability to 1/2 the previous historical availability. The second sort, based on current availability information from the gateways, occurred whenever a result set was accessed from disk. This wasn’t a true sort; it simply made sure that regardless of the historical availability, any currently unavailable song did not show up in the results before a currently available song. This ensured that new songs which could be downloaded immediately got bumped up ahead of songs which used to be popular but no one was currently sharing.

The cache servers were partitioned such that each unique search had its results cached on a single service. Each search string was hashed into a unique ID, and that number was used to determine which cache server handled the results. To improve the efficiency of the cache layer, I took advantage of the fact that our search engine ignored word ordering and duplicates, so I stripped out duplicates and sorted the words before hashing. Thus, searches for “A A B”, “B A”, or “A B B” all hashed to the same ID and thus the same results.

For efficiency, the cache servers would cache potentially large numbers of results for new searches. As the user paged through search results, each page could be fulfilled by the cache file from the initial batch.

Each server ended up with many thousands of cached results stored on disk. To avoid overwhelming the file system, we had 65K folders in two levels of 256 folders each, named 00 to FF. Each results file was identified by the hash of the query string with the first 4 hex digits controlling which folder it ended up in.

Index Servers (4 servers, written in C)
These servers implemented an inverted index for all the searchable content in our MySQL databases. The index had 2 components: an enormous in-memory hashtable of words and disk-based storage of the rowIDs that matched each word. Obviously, index servers processed queries, but they also monitored the database for two types of changes. First, each server looked for new rows in the primary table. Second, it looked for new rows in an Updates table that was used to tell the search engine to re-index existing rows. Because there may have been several hundred million indexed rows, it wasn’t feasible for the search engine to continually spider the whole table. Therefore, I used the Updates table to trigger changes when I deleted or edited a row that was already indexed.

To process queries, I used the inverted index algorithm straight out of Managing Gigabytes. Each query was broken into words, and each word was used as a key into the in-memory hashtable. The hashtable record contained the count of how many rows matched that word and an offset to the disk to read the full ID list. The service would then iterate through the words from smallest num rows to largest and intersect the word lists. To efficiently intersect the lists, it would walk the smaller list and do binary searches over the next larger one. Each intersection produced a new list that was intersected with the next larger list.

The algorithm itself is simple and powerful, but the first tricky part was managing disk IO, which is always the bottleneck for a service like this. I didn’t want a search for “The UncommonWords” to pull all the IDs that use the word the off the disk. The second tricky part dealt with managing updates to the index. If the service indexed a new row the the word the in it, I didn’t want to risk having to write the entire word list back to disk. So, the service kept a small ID list in memory and combined it with the disk backed lists as necessary. Periodically, the in-memory lists were combined with the main ones and flushed to disk.

We had several servers running this process, and each one kept a duplicate index. Fortunately, we never had to deal with sharding the index across multiple servers or a hash table of words that wouldn’t fit in memory. Speaking of which, I’d love to read some papers on how internet-scale search engines actually do that. Does anyone have any recommendations?

Web Servers (Apache, running PHP scripts)
After everything was designed and implemented, the web servers didn’t have much work to do in the final version of the search. They established a TCP connection to the appropriate cache server, sent in the search string and how many results they wanted along with a paging offset, and then read the matching row IDs and popularity back. Then, they looked up the matching rows in the MySQL database to get the original strings and rendered the page.

Communication Protocols
All of the communication between my services used a custom protocol, not SOAP or anything HTTP based. Most of the bandwidth between the services was consumed sending long lists of integer IDs back and forth, and a custom protocol made that easy to optimize. I think it also simplified using long-lived connections between the cache and the index layer.

The Life Of A Query
So, tying it all together, here is a run through of what happened when a user ran a search:

A user hits “Search” and an HTTP POST is sent to one of the load balanced web servers
The web server strips the duplicates from the search term, sorts the words, hashes the result and uses that value to pick a cache server. It establishes a TCP connection to the cache server and sends the query, blocking while it waits for the results.
The cache server accepts the request from the web server. If a file of search results does not exist for the query, it randomly picks one of the index servers and uses a pooled connection to send it a request.
The index server receives the request and builds a list of row IDs using the inverted index algorithm described above. It sends the list of IDs back to the cache server.
Assuming it didn’t have the results cached, the cache server reads the IDs back from the index server. It then queries a database to get the historical availability. Using that, it sorts the results and writes them out to disk.
Using the sorted results (either from disk or from memory if it just got them from the index server), the cache server sends a list of IDs to a random Gateway server, which responds with the current availability of each file.
The cache layer adjusts the order of the results based on the current availability. Then, it indexes into the results based on the requested offset from the web server and writes a few dozen IDs back to the web server.
The web server process unblocks and reads the matching IDs back from the cache server. It hits the database to get anything necessary to render the page, formats the results, and gzips them out to the web browser.
The user gets the response and smiles (hopefully).

So, in a few hundred milliseconds a simple search could easily touch 4 servers plus half a dozen different databases.

My favorite part of all of this was running tail -f against the logs on an index server. We had an excellent hit rate for our caching layer, so the only searches I really saw there were serious misspellings–and usually humorous ones at that.

1
顶

4
踩

分享到：

Nginx and Cronolog (zz) | YouTube Architecture

2008-03-13 17:14
浏览 2160
评论(0)
分类:数据库
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论