`

关系数据库之外的选择

阅读更多

Perhaps you’re considering using a dedicated key-value or document store instead of a traditional relational database. Reasons for this might include:

  1. You’re suffering from Cloud-computing Mania.
  2. You need an excuse to ‘get your Erlang on’
  3. You heard CouchDB was cool.
  4. You hate MySQL, and although PostgreSQL is much better, it still doesn’t have decent replication. There’s no chance you’re buying Oracle licenses.
  5. Your data is stored and retrieved mainly by primary key, without complex joins.
  6. You have a non-trivial amount of data, and the thought of managing lots of RDBMS shards and replication failure scenarios gives you the fear.

Whatever your reasons, there are a lot of options to chose from. At Last.fm we do a lot of batch computation in Hadoop, then dump it out to other machines where it’s indexed and served up over HTTP and Thrift as an internal service (stuff like ‘most popular songs in London, UK this week’ etc). Presently we’re using a home-grown index format which points into large files containing lots of data spanning many keys, similar to the Haystack approach mentioned in this article about Facebook photo storage . It works, but rather than build our own replication and partitioning system on top of this, we are looking to potentially replace it with a distributed, resilient key-value store for reasons 4, 5 and 6 above.

This article represents my notes and research to date on distributed key-value stores (and some other stuff) that might be suitable as RDBMS replacements under the right conditions. I’m expecting to try some of these out and investigate further in the coming months.

Glossary and Background Reading

The Shortlist

Here is a list of projects that could potentially replace a group of relational database shards. Some of these are much more than key-value stores, and aren’t suitable for low-latency data serving, but are interesting none-the-less.

#matrix td{ font-size:90%; vertical-align:top; padding: 3px; } #matrix tr { background: #f0f0f0; } #matrix tr.odd { background: #ddd; } #matrix td.bigger {font-size:100%;}
Name Language Fault-tolerance Persistence Client Protocol Data model Docs Community
Project Voldemort Java partitioned, replicated, read-repair Pluggable: BerkleyDB, Mysql Java API Structured / blob / text A Linkedin, no
Ringo Erlang partitioned, replicated, immutable Custom on-disk (append only log) HTTP blob B Nokia, no
Scalaris Erlang partitioned, replicated, paxos In-memory only Erlang, Java, HTTP blob B OnScale, no
Kai Erlang partitioned, replicated? On-disk Dets file Memcached blob C no
Dynomite Erlang partitioned, replicated Pluggable: couch, dets Custom ascii, Thrift blob D+ Powerset, no
MemcacheDB C replication BerkleyDB Memcached blob B some
ThruDB C++ Replication Pluggable: BerkleyDB, Custom, Mysql, S3 Thrift Document oriented C+ Third rail, unsure
CouchDB Erlang Replication, partitioning? Custom on-disk HTTP, json Document oriented (json) A Apache, yes
Cassandra Java Replication, partitioning Custom on-disk Thrift Bigtable meets Dynamo F Facebook, no
HBase Java Replication, partitioning Custom on-disk Custom API, Thrift, Rest Bigtable A Apache, yes
Hypertable C++ Replication, partitioning Custom on-disk Thrift, other Bigtable A Zvents, Baidu, yes

 

Why 5 of these aren’t suitable

What I’m really looking for is a low latency, replicated, distributed key-value store. Something that scales well as you feed it more machines, and doesn’t require much setup or maintenance - it should just work. The API should be that of a simple hashtable: set(key, val), get(key), delete(key). This would dispense with the hassle of managing a sharded / replicated database setup, and hopefully be capable of serving up data by primary key efficiently.

Five of the projects on the list are far from being simple key-value stores, and as such don’t meet the requirements - but they are definitely worth a mention.

1) We’re already heavy users of Hadoop, and have been experimenting with Hbase for a while. It’s much more than a KV store, but latency is too great to serve data to the website. We will probably use Hbase internally for other stuff though - we already have stacks of data in HDFS.

2) Hypertable provides a similar feature set to Hbase (both are inspired by Google’s Bigtable). They recently announced a new sponsor, Baidu - the biggest Chinese search engine. Definitely one to watch, but doesn’t fit the low-latency KV store bill either.

3) Cassandra sounded very promising when the source was released by Facebook last year. They use it for inbox search. It’s Bigtable-esque, but uses a DHT so doesn’t need a central server (one of the Cassandra developers previously worked at Amazon on Dynamo). Unfortunately it’s languished in relative obscurity since release, because Facebook never really seemed interested in it as an open-source project. From what I can tell there isn’t much in the way of documentation or a community around the project at present.

4) CouchDB is an interesting one - it’s a “distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”. Data is stored in ‘documents’, which are essentially key-value maps themselves, using the data types you see in JSON. Read the CouchDB Technical Overview if you are curious how the web’s trendiest document database works under the hood. This article on the Rules of Database App Aging goes some way to explaining why document-oriented databases make sense. CouchDB can do full text indexing of your documents, and lets you express views over your data in Javascript. I could imagine using CouchDB to store lots of data on users: name, age, sex, address, IM name and lots of other fields, many of which could be null, and each site update adds or changes the available fields. In situations like that it quickly gets unwieldly adding and changing columns in a database, and updating versions of your application code to match. Although many people are using CouchDB in production, their FAQ points out they may still make backwards-incompatible changes to the storage format and API before version 1.0.

5) ThruDB is a document storage and indexing system made up for four components: a document storage service, indexing service, message queue and proxy. It uses Thrift for communication, and has a pluggable storage subsystem, including an Amazon S3 option. It’s designed to scale well horizontally, and might be a better option that CouchDB if you are running on EC2. I’ve heard a lot more about CouchDB than Thrudb recently, but it’s definitely worth a look if you need a document database. It’s not suitable for our needs for the same reasons as CouchDB.

Distributed key-value stores

The rest are much closer to being ’simple’ key-value stores with low enough latency to be used for serving data used to build dynamic pages. Latency will be dependent on the environment, and whether or not the dataset fits in memory. If it does, I’d expect sub-10ms response time, and if not, it all depends on how much money you spent on spinning rust.

MemcacheDB is essentially just memcached that saves stuff to disk using a Berkeley database. As useful as this may be for some situations, it doesn’t deal with replication and partitioning (sharding), so it would still require a lot of work to make it scale horizontally and be tolerant of machine failure. Other memcached derivatives such as repcached go some way to addressing this by giving you the ability to replicate entire memcache servers (async master-slave setup), but without partitioning it’s still going to be a pain to manage.

Project Voldemort looks awesome . Go and read the rather splendid website , which explains how it works, and includes pretty diagrams and a good description of how consistent hashing is used in the Design section. (If consistent hashing butters your muffin, check out libketama - a consistent hashing library and the Erlang libketama driver ). Project-Voldemort handles replication and partitioning of data, and appears to be well written and designed. It’s reassuring to read in the docs how easy it is to swap out and mock different components for testing. It’s non-trivial to add nodes to a running cluster, but according to the mailing-list this is being worked on. It sounds like this would fit the bill if we ran it with a Java load-balancer service (see their Physical Architecture Options diagram) that exposed a Thrift API so all our non-Java clients could use it.

Scalaris is probably the most face-meltingly awesome thing you could build in Erlang. CouchDB, Ejabberd and RabbitMQ are cool, but Scalaris packs by far the most impressive collection of sexy technologies. Scalaris is a key-value store - it uses a modified version of the Chord algorithm to form a DHT, and stores the keys in lexicographical order, so range queries are possible. Although I didn’t see this explicitly mentioned, this should open up all sorts of interesting options for batch processing - map-reduce for example. On top of the DHT they use an improved version of Paxos to guarantee ACID properties when dealing with multiple concurrent transactions. So it’s a key-value store, but it can guarantee the ACID properties and do proper distributed transactions over multiple keys.

Oh, and to demonstrate how you can scale a webservice based on such a system, the Scalaris folk implemented their own version of Wikipedia on Scalaris, loaded in the Wikipedia data, and benchmarked their setup to prove it can do more transactions/sec on equal hardware than the classic PHP/MySQL combo that Wikipedia use. Yikes.

From what I can tell, Scalaris is only memory-resident at the moment and doesn’t persist data to disk. This makes it entirely impractical to actually run a service like Wikipedia on Scalaris for real - but it sounds like they tackled the hard problems first, and persisting to disk should be a walk in the park after you rolled your own version of Chord and made Paxos your bitch. Take a look at this presentation about Scalaris from the Erlang Exchange conference: Scalaris presentation video .

The reminaing projects, Dynomite , Ringo and Kai are all, more or less, trying to be Dynamo. Of the three, Ringo looks to be the most specialist - it makes a distinction between small (less than 4KB) and medium-size data items (<100MB). Medium sized items are stored in individual files, whereas small items are all stored in an append-log, the index of which is read into memory at startup. From what I can tell, Ringo can be used in conjunction with the Erlang map-reduce framework Nokia are working on called Disco .

I didn’t find out much about Kai other than it’s rather new, and some mentions in Japanese. You can chose either Erlang ets or dets as the storage system (memory or disk, respectively), and it uses the memcached protocol, so it will already have client libraries in many languages.

Dynomite doesn’t have great documentation, but it seems to be more capable than Kai, and is under active development. It has pluggable backends including the storage mechanism from CouchDB, so the 2GB file limit in dets won’t be an issue. Also I heard that Powerset are using it, so that’s encouraging.

Summary

Scalaris is fascinating, and I hope I can find the time to experiment more with it, but it needs to save stuff to disk before it’d be useful for the kind of things we might use it for at Last.fm.

I’m keeping an eye on Dynomite - hopefully more information will surface about what Powerset are doing with it, and how it performs at a large scale.

Based on my research so far, Project-Voldemort looks like the most suitable for our needs. I’d love to hear more about how it’s used at LinkedIn, and how many nodes they are running it on.

What else is there?

Here are some other related projects:

If you know of anything I’ve missed off the list, or have any feedback/suggestions, please post a comment. I’m especially interested in hearing about people who’ve tested or are using KV-stores in lieu of relational databases.

UPDATE 1: Corrected table: memcachedb does replication, as per BerkeleyDB.

分享到:
评论

相关推荐

    CHINER元数建模,一款丰富数据库生态,独立于具体数据库之外的,数据库关系模型设计平台.zip

    它介于关系数据库和非关系数据库之间,被认为是非关系数据库当中功能最丰富,最像关系数据库的产品。 2、mongoDB的基本概念 (1)数据库: 数据库和传统的关系型数据库差不多的概念,每个数据库含有多个集合,每...

    一款丰富数据库生态,独立于具体数据库之外的,数据库关系模型设计平台

    CHINER元数建模,一款丰富数据库生态,独立于具体数据库之外的,数据库关系模型设计平台,PDManer-v4已完全承接CHINER所有功能,并增加更多更多实用功能

    元数建模的Java功能部分,元数建模是一款丰富数据库生态,独立于具体数据库之外的,数据库关系模型设计平台

    元数建模的Java功能部分,元数建模是一款丰富数据库生态,独立于具体数据库之外的,数据库关系模型设计平台。

    CHINER元数建模,一款丰富数据库生态,独立于具体数据库之外的,数据库关系模型设计平台,.zip

    数据库课程设计 1.概述 学生管理是一个学校必不可少的部分,随着计算机和计算机知识的普及,学生管理系统得到了更大的发展空间,通过对学生管理系统的开发,可以提高校务人员的工作效率。 随着科学技术的不断提高...

    VB数据库详解.docx

    其中关系数据库的理论发展最为完备,因此到目前为止关系数据库的应用最为广泛。 Visual Basic默认的数据库为微软的Access数据库,可在Visual Basic中利用数据库管理器直接创建,数据库文件的扩展名为.MDB。除此之外...

    关于计算机数据库系统设计方案.doc

    关系表中的属性 值包含对象指针,对象数据的操作在关系数据库之外进行。把面向对象数据模型(ODM)和 关系数据模型(RDM)结合起来,对荚系数据库管理系统进行扩充,但对象查询功能受到一 定的限制。 2.2 把面向对象接口...

    各层次数据库加密综述

    数据库加密是利用现有的数据库和加密技术,来研究如何对数据库中的数据加、解密,从而提高数据库系统的安全。数据库加密可以在OS、DBMS内层、DBMS外层上实现。...本文就是基于关系DBMS,介绍一种数据库加密的实现方法。

    GIS数据库答案.doc

    X41614027 余云鹏 一、什么是空间数据库,具有什么特点? 答:1、空间数据库是某一区域内关于一定地理要素特征的数据集合,是地理信息系统... 缺点:(1)需要同时启动图形文件系统和关系数据库系统,甚至两个系统来回

    pg数据库手册

    pg数据库文档手册,PostgreSQL 是一个自由的对象-关系数据库服务器(数据库管理系统),它在灵活的 BSD-风格许可证下发行。它提供了相对其他开放源代码数据库系统(比如 MySQL 和 Firebird),和专有系统(比如 Oracle、...

    数据库物理设计(1).docx

    除此之外,还需要知道每个事务在各关系上运行的频率,某些事务可能具有严格的性能要求。例如,某个事务必须在20秒内结束。这种时间约束对于存取方法的选择有重大的影响。需要了解每个事务的时间约束。 值得注意的是...

    VFP数据库系统Visual-FoxPro数据库和表的高级应用.pdf

    也可能您想打开多个数据库,从而能使用应 用程序数据库之外的另一数据库中的存储信息。 【方法】: 在"项目管理器"中,选定一个数据库,然后选择"修改"按 钮或"打开"按钮。 使用 OPEN DATABASE命令。 打开新的数据库...

    非关系型缓存数据库redis

    Redis,英文全称是Remote Dictionary Server(远程字典服务),是一个开源的使用ANSI C语言编写、支持网络、可基于内存亦可持久化的日志型...除此之外,Redis支持事务、持久化、LUA 脚本、LRU 驱动事件、多种集群方案。

    高级数据库Advanced Database System

    课程内容包括:分布式数据库系统体系结构,分布式数据库设计,分布查询处理与优化,面向对象数据模型、继承,多态,对象关系数据库,事务类型,事务处理Monitor、2PC提交协议,可串性理论,并发控制方法分类,分布式...

    微软亚洲研究院开源的图数据库GraphView.zip

    关系数据库的用户对此语言会非常熟悉。 · 索引。GraphView 的用户可以建立索引来提升查询和操作效率。所有 SQL Server 和 Azure SQL Database 所支持的索引都可以用在图数据中。 · 事务处理。GraphView 提供了...

    计算机数据库原理附带考试答案

    请简述关系数据库管理系统RDBMS的分层结构。 第一层是应用层,位于RDBMS之外; 第二层是SQL语言翻译处理层。它处理的对象是数据库语言; 第三层是数据存取层。该层处理的对象是数据表的单行; 第四层是数据存储层。...

    数据库大作业--超市管理系统.docx

    对于超市来说,有很多信息是具有价值的,比如客源、服务人员以及管理层人员,除此之外,还应该保存货物的相关信息,因此,这个系统也是从三方面来展开的。超市的顾客可以通过系统得知商品的价格等信息,从而方便进行...

    CpDB:关系数据库模式和细菌基因组注释工具-开源

    该软件使我们可以在Postgres中创建一个关系数据库来托管完整的细菌基因组。 除了数据库之外,还有一些软件工具,例如解析器,可以将EMBL或GBK文件转换为CpDB关系模式。 一旦进入CpDB,就可以使用SQL从细菌基因组中...

    数据库系统概念复习总结.pdf

    ⽬录 ⽬录 第⼀章、引⾔ 1.1 ⽂件管理系统坏处 1.2 数据视图 1.3 数据模型 1.4 数据库语⾔ 第⼆章、关系模型介绍 2.1 关系数据库的结构 2.2 数据库模式 2.3 码 第三章、SQL 3.1 SQL 查询语⾔概览 3.2 SQL数据定义 ...

Global site tag (gtag.js) - Google Analytics