TIER-0.2.3 is available for download

This release of tier makes it possible to disable or enable data migration via sysfs.

To disable migration:
echo 1>/sys/block/tiera/tier/disable_migration
To enable migration:
echo 0>/sys/block/tiera/tier/disable_migration

When migration is re-enabled the migration process will immediately wake up and start. This feature allows to schedule block migration to take place at a convenient time. In future releases the sysfs interface will be expanded so that all migration related parameters can be managed via sysfs.


Mark Ruijter


Posted in Uncategorized | 13 Comments


How TIER works

Tier is a Linux kernel block device that aggregates multiple devices of different nature into one virtual block device. The idea is to combine ( expensive ) fast and ( affordable ) slow devices to build a high performance virtual device. TIER is different from Flashcache and Bcache because it does not only use a fast medium for caching. In some ways TIER and bcache use comparable techniques. Both for example will try to handle random writes sequentially. However TIER goes one step further. It keeps track of data access patterns and will over time migrate aged data to a lower tier. It will also detect that some blocks may be used more often then others and migrate these up to a higher tier.

The effects of data migration on performance

In a previous post I published some performance numbers that compare TIER to bcache and flashcache. This time the test with fio was repeated on TIER for several hours which allowed optimization to take place.


read-seq : io=16635MB, bw=56778KB/s, iops=14194 , runt=300017msec
read-rnd : io=872528KB, bw=2908.4KB/s, iops=727 , runt=300007msec
write-seq: io=8237.5MB, bw=28117KB/s, iops=7029 , runt=300001msec
write-rnd: io=6038.4MB, bw=20611KB/s, iops=5152 , runt=300001msec


read-seq : io=20480MB, bw=103370KB/s, iops=25842 , runt=202878msec
read-rnd : io=936760KB, bw=3122.4KB/s, iops=780 , runt=300014msec
write-seq: io=15604MB, bw=53263KB/s, iops=13315 , runt=300001msec
write-rnd: io=6453.1MB, bw=22025KB/s, iops=5506 , runt=300016msec


read-seq : io=11911MB, bw=203277KB/s, iops=50819 , runt= 60001msec
read-rnd : io=116236KB, bw=1936.1KB/s, iops=484 , runt= 60009msec
write-seq: io=10507MB, bw=179324KB/s, iops=44831 , runt= 60001msec
write-rnd: io=1653.5MB, bw=24989KB/s, iops=6247 , runt= 67756msec


read-seq : io=13506MB, bw=230496KB/s, iops=57623 , runt= 60001msec
read-rnd : io=273316KB, bw=4554.6KB/s, iops=1138 , runt= 60010msec
write-seq: io=12675MB, bw=216311KB/s, iops=54077 , runt= 60001msec
write-rnd: io=2588.7MB, bw=44117KB/s, iops=11029 , runt= 60085msec

The price of optimization

As hardly anything in life comes for free optimization comes with a price as well. When a volume is not being used continuously optimization can take place in periods of relative low traffic. In this case optimization works very well. When a volume is under a continuous high load choices will have to be made. Optimization will impact performance in this case for as long as the optimization takes place. After optimization the performance will most likely increase.  The trick is therefore to do optimization in such a way that the performance impact is acceptable while still allowing the optimization interval not to be to low. This part of TIER is still a work in progress and may require different policies for different workloads. The graph below clearly shows the advantages and disadvantages of the optimization process. During this 24 hour test the optimization took place once per hour. There are however a still a number of things that can be done to further reduce this negative impact an future releases will focus on diminishing this effect as well as possible.


Posted in Uncategorized | 6 Comments

TIER-0.2.0 has been released

Tier-0.2.0 adds crash recovery and some bug fixes.

A brief benchmark of tier, flashcache and bcache with fio shows these results:
read : io=16635MB, bw=56778KB/s, iops=14194 , runt=300017msec
read : io=872528KB, bw=2908.4KB/s, iops=727 , runt=300007msec
write: io=8237.5MB, bw=28117KB/s, iops=7029 , runt=300001msec
write: io=6038.4MB, bw=20611KB/s, iops=5152 , runt=300001msec

read : io=20480MB, bw=103370KB/s, iops=25842 , runt=202878msec
read : io=936760KB, bw=3122.4KB/s, iops=780 , runt=300014msec
write: io=15604MB, bw=53263KB/s, iops=13315 , runt=300001msec
write: io=6453.1MB, bw=22025KB/s, iops=5506 , runt=300016msec

read : io=20480MB, bw=167819KB/s, iops=41954 , runt=124965msec
read : io=528236KB, bw=1760.8KB/s, iops=440 , runt=300012msec
write: io=20480MB, bw=172857KB/s, iops=43214 , runt=121323msec
write: io=5091.7MB, bw=17371KB/s, iops=4342 , runt=300141msec

The SSD used in this test had a size of 10GB while the SAS drive had a size of 100GB.

The fio configuration file that was used is:
Posted in Uncategorized | 1 Comment

TIER-0.1.7 has been released.

This version of tier comes with some major changes. The caching layer has been removed from the code. EPRD can be used in cases where caching is needed. Also the block size has been changed so that TIER now uses a 1MB block size. This greatly reduces the amount of meta data that has to be stored. TIER will now automatically migrate the data between the different tiers. The policy that determines when a block should be migrated is still hard coded in this release but will be adjustable per tier in future releases. TIER will detect unclean shutdowns and unfinished migrations after unclean shutdown. However this release does not yet handle recovery.

Posted in Uncategorized | 1 Comment

Introducing TIER

Tier is a Linux kernel module that can be used to create a block device that allows automatically tiered storage. Tier can be used to aggregate up to 16 devices as one virtual device. Tier investigates access patterns to decide on which device the data should be written. It keeps track of how frequently data has been accessed as well as when it was used. Tier uses this information to decide if the data needs to be written to for example SSD/SAS or SATA.

One advantage of tier when compared to SSD caching only is that the total capacity of the tiered device is the sum of all attached devices. Kernel modules like flashcache use the SSD as cache only and therefore the capacity of the SSD is not available as part of the total size of the device.

Since TIER combines the RAM caching techniques of EPRD it is very fast. Even faster then what can be achieved with SSD only.

To get an impression of TIER performance I tested tier in this configuration.
An Intel SSD with a 160GB size is used as first tier and the second tier is made up of 6 * 300GB SAS in software RAID10.

The iometer test that is used comes from : http://vmktree.org/iometer/
Tier was configured with these parameters:

./tier_setup -f /dev/sdb:/dev/md1 -p 1000M -m 5 -b -c
                              TIER - SSD  - MD1(R10)
Max-throughput-100%read    : 32540 - 3796 - 2746
Reallife-60%rand-65%read   : 1927  - 3185 - 226
Max-Throughput-50%read     : 6890  - 1753 - 470
Random-8k-70%read          : 937   - 2870 - 401

As shown in the results table above TIER outperforms the MD raid10 on all tests. The SSD is faster in most cases but not all. TIER can outperform the SSD because it was configured to use 1GB of RAM for caching and TIER uses the speed advantage that raid10 will give on sequential reads and writes.


Posted in Uncategorized | 13 Comments

EPRD & lessfs

To get an idea of the efficiency of EPRD caching I repeated the lessfs benchmark test with EPRD caching the Intel 320 SSD.

The Intel 320SSD was registered as /dev/sdc.
EPRD was setup like this : ./eprd_setup -f /dev/sdc -m 3 -b -p 2048M
The databases eventually reach a size of 8.5 GB during this test.

Lessfs with and without EPRD

Lessfs with and without EPRD 2nd write

As the graph’s show a user space application like Lessfs speeds up with EPRD even when it is used to cache a relatively fast medium like an Intel 320 SSD. I intend to test EPRD with a number of other applications as well. Candidates that come to mind are for example openldap and Mysql.

Posted in Uncategorized | 6 Comments

Lessfs-1.5.12 performance


People frequently ask what the performance is that they may expect from Lessfs. This article will give an indication of what to expect.

About the hardware

All the tests are done using an Intel 5520HC system board with a single E5520 processor @ 2.27GHz. The meta data is written to an Intel 320 SSD while the data is written to 5 Hitachi HUA722010CLA330 SATA drives attached to an LSI Megaraid controller in RAID 5. The maximum transfer speed to the volume on the LSI controller is approximately 400 MB/sec. When I tested the same drives with Linux software raid5 I found it hard to get more then 250MB/sec out of them. Even worse is the amount of IOPS that you can get from the drives with software raid. So for now I will stick to using hardware raid.

Installing lessfs

In this test we will setup lessfs with file_io and hamsterdb 2.0.1.  After downloading and installing hamsterdb-2.0.1 we start with downloading lessfs.

wget http://sourceforge.net/projects/lessfs
tar xvzf lessfs-1.5.12.tar.gz
cd lessfs-1.5.12
./configure --with-hamsterdb --with-snappy
make -j4

In this example the RAID5 volume on the LSI raidcontroller is mount on /data
The SSD is mounted on /data/mta

The configuration file used in this example can be downloaded here : lessfs.cfg

After downloading lessfs.cfg you will need to copy it to /etc

Please make sure that the directories /data/dta and /data/mta exist.
Now we can format lessfs and mount the filesystem:

./mklessfs -c /etc/lessfs.cfg
./lessfs /etc/lessfs.cfg /mnt

When everything went right you should now have lessfs mounted on /mnt

I now use a little tool to write 3000 files with a 1GB size to lessfs. The files can not be compressed and all have a unique content. The second pass writes files 100% identical to the first pass and will therefore be written with a much higher speed. After this the first files a read from lessfs. This is the result:

Continue reading

Posted in Uncategorized | 13 Comments


Lessfs-1.5.11 now allows users to specify the cache size that hamsterdb will use internally.
This version also solves a bug in configure.ac that would cause configure with –disable-debug to actually enable debugging. This bug caused users to report very low performance in a number of cases.

Posted in Uncategorized | 13 Comments

Lessfs-1.5.10 has been released

Lessfs-1.5.10 adds support for hamsterdb-2.0.1. A small change in the Hamsterdb API makes the transition from the 1.x series not completely transparent. Do not use Lessfs with hamsterdb-2.0 since it comes with a nasty bug. Please use the latest hamsterdb-2.0.1.

When compared with Berkeley DB hamsterdb does not suffer from the performance degradation that comes with Berkeley DB when the databases are becoming large. Hamsterdb 2.X performance is considerably better then Berkeley DB or even Tokyocabinet. The code is however not that well tested or as much used as the others.

Choose wisely ;-)


Posted in Uncategorized | 5 Comments

EPRD – An eventually persistent ramdisk / disk cache

Today I uploaded a kernel project that I call eprd. This kernel module allows you to create a persistent ram disk. It can also be used to use DRAM to cache disk IO. Of course this comes with all the dangers that volatile ram introduces. EPRD does however support barriers. When barriers are enabled any sync() on the file system will result in EPRD flushing all dirty buffers to disk. It also allows to set a commit interval for flushing dirty buffers to disk.

This project can be useful whenever one needs more IOPS in a non critical environment. There is more to this project though. I am working on a kernel based high performance deduplicating block device. This project will share code and ideas from EPRD as well as Lessfs.


Mark Ruijter

Posted in Uncategorized | 33 Comments