Open Source data deduplication for less.

March 25th, 2009

lessfs – A high performance inline data deduplicating filesystem for Linux.

lessfs is released under the GNU GPLv3 license and can be downloaded from:

http://sourceforge.net/project/showfiles.php?group_id=257120

Buy Me a Beer

Read the rest of this entry »

A tale about key value databases.

August 28th, 2010

Introduction

Key value databases are becoming more popular lately. And for good reason. Although these databases offer a limited set of features they do come with one important feature : high performance. Not all key value databases are created equal though. You can choose from at least a dozen or so. Since I started to develop lessfs I have tried more then a view of them. Let me share my experiences with the most important key value databases.

A list of the key value databases evaluated for lessfs

A more complete list of key value databases can be found on : NOSQL

Berkeley DB

Lessfs development started with using Berkeley DB. This database excels in reliability. No matter how the server crashes or goes down, bdb always recovers as long as it has the log files to do so. I have used Berkeley DB extensively with OpenLdap and it never let us down. Berkeley DB does have a dark side though. But a typical Ldap server will not reveal this dark side. Ldap servers are mostly used and optimized for read operations. This is why Berkeley DB and Ldap work fine. Everything changes when the application needs write performance. In this case Berkeley DB is very slow. Way to slow for the kind of performance that Lessfs needs.

gdbm/ndbm

These are the old school databases. These databases a both slow and not reliable enough for modern applications. They do not support transactions / ACID operations and are easily damaged upon for example an unexpected power down. See the benchmark below for the performance numbers.

Tokyocabinet

Tokyocabinet is currently used with Lessfs. Tokyocabinet really excels in speed. As far as I can tell it holds the record on speed for a key value database that supports both read and write operations. Dan Berstein’s constant database (cdb) is just a bit faster. But cdb is a constant database and therefore not what we are looking for. Tokyocabinet comes with a very nice API and is very well documented. So this appears to be the perfect key value database, right? Well, no. Tokyocabinet comes with a few problems of it’s own. In the beginning TC did not come with transactions. The author relies on replication to keep his data safe. Later on support for transactions was added to TC making it somewhat more resilient to crashes and unexpected power-downs. I am deliberately using the word ’somewhat’ here. In practice it is still very easy to damage a TC database beyond repair. Just pressing the power switch when the database is under load usually does the trick. I am not complaining that I might loose a some data in this case. But the fact that the database might be damaged beyond repair is not acceptable in my opinion. It also looks like development on TC has stopped or at least slowed down in favor of Kyotocabinet. Last but no least my attempts to discuss problems with the author mostly results in one way traffic. Especially when it comes to discussing data corruption problems the silence is deafening.

Hamsterdb

Recently I came across hamsterdb. A key value database that targets the embedded market. Hamsterdb comes with very decent performance and is highly crash resilient. At least I have not been able to corrupt hamsterdb with for example deliberate power-downs when the database is under high load. Hamsterdb is well documented and the hamsterdb source code is easy to read, which makes this database very attractive. It is also possible to buy a commercial license for a reasonable price which makes it even better.

Performance numbers.

Tokyocabinet comes with a set of performance tests in the /bros directory. I have created a few additional tests to be able to test TC/BDB and hamsterdb with and without transaction support.

NAME COMMENT RECORDS WRITE(SEC) READ(SEC)
gdbm 1000000 7.385 2.505
TC Without transactions 1000000 0.339 0.387
BDB Without transactions 1000000 3.666 2.258
HAMSTERDB Without transactions 1000000 1.376 1.360





TC With transactions 10000 0.376 0.024
TC HDBOTSYNC+transactions 10000 1440.078 0.029
BDB With transactions 10000 116.897 2.312
HAMSTERDB With transactions 10000 0.484 0.068

I tested TC twice, once with and without HDBOTSYNC. With HDBOTSYNC TC synchronizes the disk after every transaction. This should ensure that the database survives unexpected termination without loss of data. But since it then takes 1440 seconds to write 10000 records this can hardly be seen as a valid solution to the database corruption problem.

Conclusion

There is a large number of very good open source key value databases available today. Each database has its own benefits and dark sides. Choosing one that is optimal for your application depends on the performance that is required as well as reliability and support that you can expect from the community. As always the perfect solution does not exist. I would have liked to see a key value database that works with a log based approach. PBXT is an example of a database design that works this way but regrettably PBXT is not available as merely a key value database.

Lessfs-1.1.6 is available for download.

August 22nd, 2010

This versions adds two new features to lessfs.
A. Synchronous replication.
Lessfs now supports a master / slave configuration. In this case the master will synchronously replicate the data to the slave. The slave is read-only.
The master will freeze for ever when the slave becomes unreachable. Many options will be added in the future, including maintaining a replication backlog and async replication. The replication feature should be considered alpha code.
B. Background delete.
When enabled lessfs will spawn a background thread that will take care of the truncation/delete task. Lessfs will therefore return immediately after a delete or truncate operation. This is a new feature and is therefore not enabled by default.

Enjoy,

Mark Ruijter

Lessfs-1.1.6-testing is available for download.

July 22nd, 2010

Lessfs with background deletion and truncation.

Today I decided to upload a pre-release of lessfs-1.1.6. It has a new feature that a number of users have been lobbying fore. The feature is background deletion / truncation of files.

Let me explain this a bit further. Up till now deleting large files has been an operation that required patience. Lessfs will have to go through the deleted chunks to see if they are reference by any other file. If not it can be put on the freelist (file_io) or deleted from the database with the tc data store.

This test release now marks the file as being deleted and spawns a thread that will do the actual work in the background. While the thread is running it sets a lock on the inode and a lock per hash / per block. The only disadvantage is that when you unmount lessfs, the actual lessfs process may stay around for some time until it has finished it’s background tasks. In the future these delete/truncate operations may become restartable but for now they are not.

You can download the new code from:
lessfs_background_delete

Enjoy,

Mark Ruijter

Lessfs-1.1.5 is now available for download

July 21st, 2010

This version improves crash recovery and solves a potential deadlock when lessfs is used with samba.
Lessfs now inserts test transactions in the databases when lessfs is mounted. When a transactions fails the database it automatically optimized. Instead of halting when finding ‘non fatal inconsistency’ lessfs will now repair the problem at runtime and log the inode and blocknr of the file that has been corrupted.

Lessfs-1.1.3 is available.

July 15th, 2010

This release fixes a problem where lessfs would leave orphaned data chunks in the system under high load.
A number of issues with lessfschk has also been solved.

Lessfs-1.1.2 has been released.

July 7th, 2010

Everyone who is using a 1.1.x release is encouraged to upgrade to this version. It solves a race condition that will crash lessfs when it is put under heavy load.
This release also improves performance.

Lessfs-1.1.0 has been released.

June 30th, 2010

Lessfs-1.1.0 is now available for download.
As with any .0 release there may still be small issues that can pop up.

The good news is that lessfs-1.1.0 is very fast, both for reading and writing. Compression has improved in many ways when compared to lessfs-1.0.8. Lessfs now does not compress a whole (BLKSIZE) block of data when only a small part of the block is actually filled with data. Lessfs-1.0.x would fill a block with zero’s, copy the data in the block and then compress the whole block. This resulted in bad compression for small files.

Lessfs-1.1.x now supports wide a range of compression algorithms, like deflate,gzip,lzo,qlz and zlib.

For people who would like to reduce the possibility of a hash collision to almost nil lessfs now supports up to 64 byte long hashes like whirlpool.
The popular sha256 is now also supported.

A start has been made to provide better compression statistics in .lessfs/lessfs_stats. Lessfs can now deliver very decent performance with ietd, especially when ietd has been patched to write with the same blocksize that lessfs uses.

Enjoy,

Mark Ruijter

PLEASE NOTE THAT LESSFS-1.1.x is not compatible with LESSFS-1.0.x.

Lessfs-1.1.0-beta6 has been released.

June 21st, 2010

Lessfs-1.1.0-beta6 is now available for download. This release brings much improved read performance. Lessfs-1.1.0-beta6 should be the last beta release. When no bugs are reported the coming week this code will be released as 1.1.0. People who would like to download a full comparison between zfs,sdfs and lessfs can download the iozone graphs from this location:

iozone – ZFS/SDFS/LessFS

Enjoy,

Mark Ruijter

Lessfs-1.1.0-beta4 is available.

June 19th, 2010

Lessfs-1.1.0-beta4 is now available for download. This release improves write performance when writes are done with small block sizes. The graph below displays the write performance of XFS, the fastest filesystem that is available for Linux and all other de-duplicating filesystems that are available for Linux. Both Lessfs and ZFS-fuse are deduplicating and compressing the data. SDFS does deduplicate but does not compress the data.

Lessfs-1.1.0-beta3 is out and looks promising.

June 15th, 2010

This version of lessfs now comes with a ‘per hash’ lock which protects a hash / data chunk while it is still being written to disk. The way this is done does not effect performance because it is highly granular. This said, I should still check if using a spinlock instead of mutex locking improves performance. Other then that this code looks pretty stable.

For those who are interested in performance. The graph below compares XFS with ZFS-fuse and Lessfs-1.1.0-betaX On read performance Lessfs is outperformed by ZFS. But when it comes to writing, Lessfs is a clear winner. Lessfs will be tuned for metadata operations in the upcoming 2.x release. LFS now uses the high level fuse API but will switch to the lowlevel API for better metadata performance.