A tale about key value databases.

Introduction

Key value databases are becoming more popular lately. And for good reason. Although these databases offer a limited set of features they do come with one important feature : high performance. Not all key value databases are created equal though. You can choose from at least a dozen or so. Since I started to develop lessfs I have tried more then a view of them. Let me share my experiences with the most important key value databases.

A list of the key value databases evaluated for lessfs

A more complete list of key value databases can be found on : NOSQL

Berkeley DB

Lessfs development started with using Berkeley DB. This database excels in reliability. No matter how the server crashes or goes down, bdb always recovers as long as it has the log files to do so. I have used Berkeley DB extensively with OpenLdap and it never let us down. Berkeley DB does have a dark side though. But a typical Ldap server will not reveal this dark side. Ldap servers are mostly used and optimized for read operations. This is why Berkeley DB and Ldap work fine. Everything changes when the application needs write performance. In this case Berkeley DB is very slow. Way to slow for the kind of performance that Lessfs needs.

gdbm/ndbm

These are the old school databases. These databases a both slow and not reliable enough for modern applications. They do not support transactions / ACID operations and are easily damaged upon for example an unexpected power down. See the benchmark below for the performance numbers.

Tokyocabinet

Tokyocabinet is currently used with Lessfs. Tokyocabinet really excels in speed. As far as I can tell it holds the record on speed for a key value database that supports both read and write operations. Dan Berstein’s constant database (cdb) is just a bit faster. But cdb is a constant database and therefore not what we are looking for. Tokyocabinet comes with a very nice API and is very well documented. So this appears to be the perfect key value database, right? Well, no. Tokyocabinet comes with a few problems of it’s own. In the beginning TC did not come with transactions. The author relies on replication to keep his data safe. Later on support for transactions was added to TC making it somewhat more resilient to crashes and unexpected power-downs. I am deliberately using the word ‘somewhat’ here. In practice it is still very easy to damage a TC database beyond repair. Just pressing the power switch when the database is under load usually does the trick. I am not complaining that I might loose a some data in this case. But the fact that the database might be damaged beyond repair is not acceptable in my opinion. It also looks like development on TC has stopped or at least slowed down in favor of Kyotocabinet. Last but no least my attempts to discuss problems with the author mostly results in one way traffic. Especially when it comes to discussing data corruption problems the silence is deafening.

Hamsterdb

Recently I came across hamsterdb. A key value database that targets the embedded market. Hamsterdb comes with very decent performance and is highly crash resilient. At least I have not been able to corrupt hamsterdb with for example deliberate power-downs when the database is under high load. Hamsterdb is well documented and the hamsterdb source code is easy to read, which makes this database very attractive. It is also possible to buy a commercial license for a reasonable price which makes it even better.

Performance numbers.

Tokyocabinet comes with a set of performance tests in the /bros directory. I have created a few additional tests to be able to test TC/BDB and hamsterdb with and without transaction support.

NAME COMMENT RECORDS WRITE(SEC) READ(SEC)
gdbm 1000000 7.385 2.505
TC Without transactions 1000000 0.339 0.387
BDB Without transactions 1000000 3.666 2.258
HAMSTERDB Without transactions 1000000 1.376 1.360





TC With transactions 10000 0.376 0.024
TC HDBOTSYNC+transactions 10000 1440.078 0.029
BDB With transactions 10000 116.897 2.312
HAMSTERDB With transactions 10000 0.484 0.068

I tested TC twice, once with and without HDBOTSYNC. With HDBOTSYNC TC synchronizes the disk after every transaction. This should ensure that the database survives unexpected termination without loss of data. But since it then takes 1440 seconds to write 10000 records this can hardly be seen as a valid solution to the database corruption problem.

Conclusion

There is a large number of very good open source key value databases available today. Each database has its own benefits and dark sides. Choosing one that is optimal for your application depends on the performance that is required as well as reliability and support that you can expect from the community. As always the perfect solution does not exist. I would have liked to see a key value database that works with a log based approach. PBXT is an example of a database design that works this way but regrettably PBXT is not available as merely a key value database.

This entry was posted in Uncategorized. Bookmark the permalink.

12 Responses to A tale about key value databases.

  1. Sean Leyne says:

    You should look at Firebird, it provides full ACID with extreme resiliency without the need for log files. It comes supports an embedded deployment for easy integration.

  2. maru says:

    As far as I can tell Firebird is what used to be Interbase?
    For a quick scan of the documentation I see that it is a SQL database and not a key value database and therefore most likely much slower then it’s key value cousins. So the next question is, what about performance?

    It is a rather interesting product though.

    • Sean Leyne says:

      It seems that there is an instrict trade-off. Performance vs. resiliency.

      Given the nature of an FS, it would seem that resiliency is what matters most.

      As for performance, assuming that your database statements will be prepared, the performance should be very good. I must admin that I have never been too concerned about the performance on an individual/simple insert/update/select statement for my application needs.

  3. Szycha says:

    Will you be switching from TC to anything else (e.g. Hamsterdb)?

    Thanks for your work, as usual.

  4. John says:

    I’m concerned that your testing does not take disk i/o into consideration — especially in the case of TC without HDBOTSYNC. In 0.376 sec, I doubt TC has yet written anything to disk. In the real-word use of lessfs, once TC begins to commit to disk, performance drops horribly.

    I think that how many records can be inserted over a sustained period of time would be far more interesting than timing a sort burst of records. Sustained throughput to multiple databases, potentially all using the same block device, would be closer to the use case of lessfs.

  5. Martin K. says:

    These guys accomplished 750,000 qps on a commodity server by writing a MySQL plugin that speaks NoSQL:
    http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html

    For lessfs, Embedded InnoDB might even be enough:
    http://www.innodb.com/products/embedded-innodb/

  6. I disagree with the overall premise of this page but I still think its pretty useful. I really like your writing style. Keep up the great work.

  7. Viet says:

    Hi. Thank you for analysis. I’m considering HamsterDB as a replacement for SQLite as I hust need speed and reliability, without schema. Would you please share the test code? Thank you.

  8. john says:

    Those numbers are surprising. Would you post code? GDBM should be more than twice as fast as Tokyo/Kyoto for reads, yet really slow for writes. GDBM is famous for this trade-off.

  9. Pingback: Architecture for a Deduplicated Archival Store: Part 2 | The Pseudo Random Bit Bucket

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>