This version of tier comes with some major changes. The caching layer has been removed from the code. EPRD can be used in cases where caching is needed. Also the block size has been changed so that TIER now uses a 1MB block size. This greatly reduces the amount of meta data that has to be stored. TIER will now automatically migrate the data between the different tiers. The policy that determines when a block should be migrated is still hard coded in this release but will be adjustable per tier in future releases. TIER will detect unclean shutdowns and unfinished migrations after unclean shutdown. However this release does not yet handle recovery.
Tier is a Linux kernel module that can be used to create a block device that allows automatically tiered storage. Tier can be used to aggregate up to 16 devices as one virtual device. Tier investigates access patterns to decide on which device the data should be written. It keeps track of how frequently data has been accessed as well as when it was used. Tier uses this information to decide if the data needs to be written to for example SSD/SAS or SATA.
One advantage of tier when compared to SSD caching only is that the total capacity of the tiered device is the sum of all attached devices. Kernel modules like flashcache use the SSD as cache only and therefore the capacity of the SSD is not available as part of the total size of the device.
Since TIER combines the RAM caching techniques of EPRD it is very fast. Even faster then what can be achieved with SSD only.
To get an impression of TIER performance I tested tier in this configuration.
An Intel SSD with a 160GB size is used as first tier and the second tier is made up of 6 * 300GB SAS in software RAID10.
The iometer test that is used comes from : http://vmktree.org/iometer/
Tier was configured with these parameters:
./tier_setup -f /dev/sdb:/dev/md1 -p 1000M -m 5 -b -c
TIER - SSD - MD1(R10)
Max-throughput-100%read : 32540 - 3796 - 2746
Reallife-60%rand-65%read : 1927 - 3185 - 226
Max-Throughput-50%read : 6890 - 1753 - 470
Random-8k-70%read : 937 - 2870 - 401
As shown in the results table above TIER outperforms the MD raid10 on all tests. The SSD is faster in most cases but not all. TIER can outperform the SSD because it was configured to use 1GB of RAM for caching and TIER uses the speed advantage that raid10 will give on sequential reads and writes.
To get an idea of the efficiency of EPRD caching I repeated the lessfs benchmark test with EPRD caching the Intel 320 SSD.
The Intel 320SSD was registered as /dev/sdc.
EPRD was setup like this : ./eprd_setup -f /dev/sdc -m 3 -b -p 2048M
The databases eventually reach a size of 8.5 GB during this test.
As the graph’s show a user space application like Lessfs speeds up with EPRD even when it is used to cache a relatively fast medium like an Intel 320 SSD. I intend to test EPRD with a number of other applications as well. Candidates that come to mind are for example openldap and Mysql.
People frequently ask what the performance is that they may expect from Lessfs. This article will give an indication of what to expect.
About the hardware
All the tests are done using an Intel 5520HC system board with a single E5520 processor @ 2.27GHz. The meta data is written to an Intel 320 SSD while the data is written to 5 Hitachi HUA722010CLA330 SATA drives attached to an LSI Megaraid controller in RAID 5. The maximum transfer speed to the volume on the LSI controller is approximately 400 MB/sec. When I tested the same drives with Linux software raid5 I found it hard to get more then 250MB/sec out of them. Even worse is the amount of IOPS that you can get from the drives with software raid. So for now I will stick to using hardware raid.
In this test we will setup lessfs with file_io and hamsterdb 2.0.1. After downloading and installing hamsterdb-2.0.1 we start with downloading lessfs.
tar xvzf lessfs-1.5.12.tar.gz
./configure --with-hamsterdb --with-snappy
In this example the RAID5 volume on the LSI raidcontroller is mount on /data
The SSD is mounted on /data/mta
The configuration file used in this example can be downloaded here : lessfs.cfg
After downloading lessfs.cfg you will need to copy it to /etc
Please make sure that the directories /data/dta and /data/mta exist.
Now we can format lessfs and mount the filesystem:
./mklessfs -c /etc/lessfs.cfg
./lessfs /etc/lessfs.cfg /mnt
When everything went right you should now have lessfs mounted on /mnt
I now use a little tool to write 3000 files with a 1GB size to lessfs. The files can not be compressed and all have a unique content. The second pass writes files 100% identical to the first pass and will therefore be written with a much higher speed. After this the first files a read from lessfs. This is the result:
Lessfs-1.5.11 now allows users to specify the cache size that hamsterdb will use internally.
This version also solves a bug in configure.ac that would cause configure with –disable-debug to actually enable debugging. This bug caused users to report very low performance in a number of cases.
Lessfs-1.5.10 adds support for hamsterdb-2.0.1. A small change in the Hamsterdb API makes the transition from the 1.x series not completely transparent. Do not use Lessfs with hamsterdb-2.0 since it comes with a nasty bug. Please use the latest hamsterdb-2.0.1.
When compared with Berkeley DB hamsterdb does not suffer from the performance degradation that comes with Berkeley DB when the databases are becoming large. Hamsterdb 2.X performance is considerably better then Berkeley DB or even Tokyocabinet. The code is however not that well tested or as much used as the others.
Today I uploaded a kernel project that I call eprd. This kernel module allows you to create a persistent ram disk. It can also be used to use DRAM to cache disk IO. Of course this comes with all the dangers that volatile ram introduces. EPRD does however support barriers. When barriers are enabled any sync() on the file system will result in EPRD flushing all dirty buffers to disk. It also allows to set a commit interval for flushing dirty buffers to disk.
This project can be useful whenever one needs more IOPS in a non critical environment. There is more to this project though. I am working on a kernel based high performance deduplicating block device. This project will share code and ideas from EPRD as well as Lessfs.
This version of Lessfs contains some minor bug fixes as well as something new. Although my previous post states that no new features would be added to the 1.x series this release actually does. Lessfs now supports the LZ4 compression algorithm. Adding support for new compression methods to lessfs is not much work at all and there have been a number of votes for adding LZ4 so it can be compared with Googles snappy compression.
I did not test LZ4 on high end hardware yet. However even on my laptop it is clear that LZ4 does outperform snappy. With the hardware being the bottleneck LZ4 still manages to speed things up by 2~5%. Most likely the difference will be larger when fast hardware is used. The system that I use for performance testing has Berkeley DB stored on SSD and the data on a fast raid5 array containing 8 SATA drives.
I will post the exact performance numbers on low and high end hardware after testing has finished.
This version of Lessfs comes with a significant number of changes.
The multifile_io backend is now fully functional with replication.
By default Lessfs will now compile with berkeleydb instead of tokyocabinet. Lessfs requires berkeleydb >= 4.8.
Batch replication has been extensively tested and improved. Some nasty problems that could occur when either the master of the slave suffered an unclean shutdown have been solved.
In the SPEC files snappy compression is now enabled by default.
Lessfs-1.6.0 will be the last of the 1.x serie releases that introduces new features. From now on the Lessfs-1.x series will remain frozen and new releases will only contain bug fixes.
How lessfs plays tetris and wins
Many data de-duplicating file systems or backup solutions struggle with the same thing. Deleting data from such a file system is complicated since a single chunk of data can be used by many files. Things become even more complicated when the file system also compresses the data. In that case the chunks no longer have a nice equal size but instead they can have any size between zero and the maximum allowed block size. This makes reusing the available space similar to playing tetris.
Solving the puzzle
The current versions of Lessfs have two ways to handle garbage collection. The file_io backend simply keeps a freelist with offsets that can be reused by the file system. It does not free up space to the underlying file system. It just no longer grows when free space can be found in the file. This strategy comes with a number of drawbacks though. One disadvantage is that finding and filling holes in the file takes time and causes the IO to become very random. Which of course is bad for the throughput. And sadly your disk will still be full with a large blockdata file, even when you have removed most of the data from the filesystem. The chunk_io backend does not have this disadvantage. When you delete the data from the file system the individual chunks are simply removed and all is well. Or is it? The disadvantage in this case is that millions of chunks will result in millions of files that have to be stored on the underlying file system. Btrfs does this pretty efficient and is therefore usable. However all file systems suffer when many millions of files have to be stored or deleted.
Problem solved : Lessfs multifile_io
Lessfs 1.6 introduces a new backend : multifile_io that addresses all the problems that where mentioned before.
Data is now stored in chunks that are rounded at 512 bytes. So a compressed chunk with a size of 4000 bytes will allocate 4096 bytes on disk. Lessfs simply opens 256 files. One file for chunks that are 512 bytes in size, one file for chunks that are 1024 bytes in size and so on. This simplifies our game of tetris quite a bit since you can now easily move a block from the top of the file to a hole somewhere at the bottom of the file. However doing so with a life file system would be rather complicated and not safe at all. Therefore Lessfs opens two sets of 256 files. The first file set is active for writing data, while the second is being optimized. When then Lessfs is done with optimizing the second file set it switches to the first file set and the writes are done to the second fileset. Since Lessfs uses transactions it switches the fileset used for writing at a moment when Lessfs is stable, the transactions are committed and no writes are done. Lessfs also waits before actually truncating the optimized files so that it is certain that this can be done safely because the databases have already been committed to disk.
To be able to relocate a chunk of data that is stored at the end of a file we need to be able to determine the hash of the data. In theory it is possible to uncompress the chunk and recalculate the hash. In this case Lessfs simply stores the hash before the data chunk. This also makes it possible to easily verify or even relocate the data with a separate program.
In fact this makes a tiered solution with data automatically migrating between SSD/SAS/SATA very simple to implement. Although data migration is even easier with the chunk_io backend.
So there you have it. Lessfs now supports online space reclamation that is safe and performance efficient even though lessfs uses data compression.
Things to come
On the top of my list is now switching to the lowlevel fuse interface. This will make Lessfs much faster in combination with SAMBA or NFS. Also improving replication and support for data tiering are high on the list. When Lessfs switches to the lowlevel API support for tokyocabinet as database will most likely be dropped. Support for using TC as data store will however disappear for sure. This removes a lot of obsoleted code from the project which is always a good thing.
Lessfs-1.6.0-alpha0 is the first release that contains multifile_io. This is still alpha quality code and replication does not yet work with multifile_io.