`
iunknown
  • 浏览: 404083 次
社区版块
存档分类
最新评论

[zz]Tokyo Cabinet Observations

阅读更多
http://parand.com/say/index.php/2009/04/09/tokyo-cabinet-observations/



I’m using Tokyo Cabinet with Python tc for a decent sized amount of data (~19G in a single hash table) on OS X. A few observations and oddities:

    * Writes slow down significantly as the database size grows. I’m writing 97 roughly equal sized batches to the tch table. The first batch takes ~40 seconds, and processing time seems to increase fairly linearly, with the last taking ~14 minutes. Not sure why this would be the case, but it’s discouraging. I’ll probably write a simple partitioning scheme to split the data into multiple databases and keep the size of each small, but it seems like this should be handled out of the box for me.
    * [Update] I implemented a simple partitioning scheme, and sure enough it makes a big difference. Apparently keeping the file size small (where small is < 500G) is important. Surprising - why doens’t TC implement partitioning if it’s susceptible to performance issues with larger file sizes? Is this a python tc issue or a Tokyo Cabinet issue?
    * [Also] Seems I can only open 53-54 tc.HDB()’s before I get an ‘mmap error’, limiting how much I can partition.
    * Reading records that have already been read from the tch seems to go much faster on the second access (like an order of magnitude faster). I suspect this is the disk cache at work, but if anyone has extra info on this please enlighten me.

Another somewhat surprising aspect: using the tc library you’re essentially embedding Tokyo Cabinet in your app; I had assumed it was going to be network based access, but it’s not. You can do network access either using the memcached protocol or using pytyrant.


分享到:
评论
5 楼 iunknown 2009-06-01  
http://tokyocabinet.sourceforge.net/spex-en.html
improves robustness : database file is not corrupted even under catastrophic situation.
4 楼 iunknown 2009-06-01  
http://torum.net/2009/05/tokyo-cabinet-protected-database-iteration/

If all goes well, the counter variable will be set to the number of records in the database. This function is slightly more complex than using tchdbiternext() but you are guaranteed to iterate atomically which is pretty important for a table scanner.
3 楼 iunknown 2009-06-01  
http://www.dmo.ca/blog/benchmarking-hash-databases-on-large-data/

As can be seen from those results, CDB kills all comers in this simulation of our normal workload. Perhaps there are ways to tune Tokyo Cabinet to perform better on large data sets?

相关推荐

Global site tag (gtag.js) - Google Analytics