BTIER-1.0.0 stable has been released

For some time people have been using btier in production. Some even use complex configurations that stack drbd and Oracle on top of btier. The good news is that even heavy users have not reported problems with btier.

Therefore the time has come to announce the first stable release.


This entry was posted in Uncategorized. Bookmark the permalink.

17 Responses to BTIER-1.0.0 stable has been released

  1. Quintin says:

    Great news! Thanks for such an awesome tool, did you have any plans to get BTIER merged with the mainline kernel?

  2. Pingback: Первый стабильный выпуск BTIER, блочного устройства для агрегирования накопителей в Linux | — Всероссийский портал о UNIX-системах

  3. Stephan Budach says:

    Hi Mark,

    congrats on releasing 1.0! btier has so much potential and it has been rock solid for me for the past months.

    Thanks for such a wonderful and well engineered tool.

  4. John says:

    Great project. I have tested and found that this a good option when we have lots of data needed to be written and accessed immediately. It will be great if you add a new feature

    Something like /sys/block/sdtiera/tier/emergency_migrate

    There should be option to define two % values.

    a) First % should be the safe usage level of Tier 0 device (SSD).
    b) How much % of old data should be moved immediately to Tier 1 (HDD).

    More specifically, /sys/block/sdtiera/tier/emergency_migrate is monitored constantly, when the data usage of Tier 0 becomes 85% the emergency migration should begin and push 30% of the oldest data blocks to Tier1 and the Tier 0 usage reduces and gives more room to accept new write requests.

    • Mark Ruijter says:

      You can already implement this in userspace with the help of the API.
      There are two examples that should give you an idea of things that you can do with the API:


      To show the metadata of blocknr 22 :
      echo 22 >/sys/block/sdtiera/tier/show_blockinfo

      And now simply:
      cat /sys/block/sdtiera/tier/show_blockinfo

      Migrate block 22 to tier 1
      echo 22/1 >/sys/block/sdtiera/tier/migrate_block

  5. John says:

    Looks like this was not stable as expected. After 24hrs run all perfromance gone and btier status showing some weird results

    [root@localhost tier]# cat /sys/block/sdtiera/tier/device_usage
    0 sda3 49945 0 4034061684 0 18446744073709551606 318
    1 sdb2 1837337 226617 1 4 3045229 8898157

    [root@localhost tier]#

    Tried deactivating LVM and rebooting the server, but no luck the status are same and on atop I can see that all requests are pushed to the SATA back end ( Tier1)

    Do you have any solution for this ?

    Also, how to use /sys/block/sdtiera/tier/clear_statistics

  6. Mark Ruijter says:

    The numbers that you are using are _way_ to low / aggressive.
    This will result in data being moved around all the time and therefore lower overall performance.

    As a minimum interval I would use 30~60 minutes for the SSD and a few hours for the SATA drive.
    To be effective this also requires adjusting the migration_interval. Default is once per 4 hours.

    For optimal performance the SSD should be able to contain your hot data.
    Usually this is approx. 10~25% of the total. So a 2TB SATA drive would be a good match for a 450GB SSD.

    You can also consider setting : echo 1 >/sys/block/sdtiera/tier/sequential_landing
    This will direct all sequential IO to the SATA drive which it can handle pretty well.
    Therefore the SSD will only be used for random IO which is what it does best.


    > 4) Yes I changed /sys/block/sdtiera/tier/migration_policy
    > [root@localhost tmp]# cat /sys/block/sdtiera/tier/migration_policy
    > tier device max_age hit_collecttime
    > 0 sda3 10 10
    > 1 sdb2 60 60
    > 5) I am doing 80% sequential I/O and 20% random I/O. This is a test VM node with Raid10 SSD and Raid10 SATA. Intention is to offer SSD cached VMs targeted to heavy upload/download users.
    > Regards,
    > John
    > On Wed, Jul 3, 2013 at 3:27 PM, Mark Ruijter wrote:
    > Hi John,
    > It looks like the statistics counters have corrupted or overflown.
    > Can you run: btier_inspect for me before resetting the statistics?
    > It works similar to btier_setup and dumps a backup of your metadata in /tmp.
    > ./btier_inspect -f /data/ssd.img:/data/sas.img -b
    > This will create these files in /tmp
    > bitlist0 and bitlist1 (since I had two devices)
    > magic_dev0 and magic_dev1
    > And the file blocklist0
    > Can you email me those files?
    > Your data is not part of them. They contain only btier metadata.
    > echo 1 >/sys/block/sdtiera/tier/clear_statistics will reset the statistics.
    > You can put that in the nightly cron when needed as well.
    > About pushing IO to SATA.
    > What is the output of :
    > cat /sys/block/sdtiera/tier/sequential_landing
    > Did you change : /sys/block/sdtiera/tier/migration_policy?
    > Are you doing random or sequential IO?
    > Let me inspect your metadata before coming to conclusions.
    > Mark
    > P.S. Can you also share your kernel version, and btier messages from dmesg or /var/log/messages should you have those?
    > On 7/3/13 3:04 AM, John wrote:
    > Looks like this was not stable as expected. After 24hrs run all perfromance gone and btier status showing some weird results
    > [root@localhost tier]# cat /sys/block/sdtiera/tier/device_usage
    > 0 sda3 49945 0 4034061684 0 18446744073709551606 318
    > 1 sdb2 1837337 226617 1 4 3045229 8898157
    > [root@localhost tier]#
    > Tried deactivating LVM and rebooting the server, but no luck the status are same and on atop I can see that all requests are pushed to the SATA back end ( Tier1)
    > Do you have any solution for this ?
    > Also, how to use /sys/block/sdtiera/tier/clear_statistics

  7. Yo Mark,

    Is LessFS Pining for the Fjords? In other words, is the project developmentally stalled AKA dead AKA kaput?

  8. Yuri Tcherepanov says:

    Mark – thank you for this great tool!

    Please review patch:
    Configuration moved to /etc/btier/*, configuration not rewriten on “make install”
    /etc/btier/btmtab – automounting configuration on boot/shutdown for
    debian/ubuntu support in init script.
    Also building for custom kernel version is supported with: KVER= make

    • maru says:

      Hi Yuri,

      Sorry for the extreme delay. I failed to notice you comment when I last reviewed incoming messages. I’ll take a look at you path and merge what makes sense.



  9. Yuri Tcherepanov says:

    KVER=some_kernel_version make

  10. Aleksi says:

    How about redundancy, what if the Tier0 SSDs fail?

    Assume a setup of 5xSSD in RAID5 and all of sudden 2 of the SSDs fail at the sametime?
    SSDs are so damn fragile :(

    Are there anything else to do than adding couple drives, doing RAID6 + 1x hot spare to minimize the risk or … ?

    It would be nice option to have “backup copy” always on SATA drives at all times for those running more of an critical load for data losses.
    Yea, i know that would lower btier to “just as a cache”, but since btier has better (actually sane) algos on handling what is hot and what is not, it would excel as “cache only” as well :)

    • maru says:

      For redundancy Linux has md that allows you to create stacked raids.
      Modern SSD’s like Intel S3700 are extremely reliable. Using Raid5 on them does seem silly though, since you buy these for IOPS.

      To minimize the risk you could think of some SSD’s in RAID1, some SATA in RAID10 and some SATA in 6(0). And as always, since RAID is not backup you should make sure that a proper backup regime is part of the solution.
      You can also replicate you data should you choose to do so.

      Adding another md layer to btier does not make sense to me.


      P.S. Just adding a backup copy on SATA is an oversimplified idea since this copy would need to be able the handle the IOPS of the whole stack which most likely includes SSD’s.

      • Aleksi says:

        Still in our experience SSDs can be fragile – the most worry some is that they fail without any warning.
        Since wear is one issue with SSDs – what i’m afraid of is that too many SSDs fails at the same time.

        We are building bigger arrays – each of which may host 100 customer’s data – therefore redundancy is quite an important factor for us – but also the cost of implementation due to our niche (low end dedis meant for data distribution).

        Raid5 on SSD -> In our instance the performance hit does seem somewhat negligible :)
        Since we host system images, we only need to verify an average of 100+ IOPS per system – basicly same as single magnetic disk on each system, but we also need to do this more cost effectively than just having local disks on each system.
        and we don’t need to support ultra high IOPS, like databases etc. For that we have higher end models for customers with local SSD drives.

        This means we need to have combined higher total IOPS than the underlying magnetic drives, and simply use larger disks than otherwise would be used.

        All the caching software has been worse than just disappointing so far due to various design flaws, usually, killing the SSD performance, and then acting as a brake for the whole array :(

        We calculated the cost for each system having their own disks to be 7.6€ per month per disk – since we charge only 25€ a month for cheapest one, this is too high of an cost. SAN also saves us a lot of other management headaches (while adding new ones tho).

        Since it’s pretty much the same cost no matter the disk size (operational cost is what makes bulk of the cost), it makes sense to utilize largest drives available, and since we can put several systems worth of data per each, and need to counter the additional cost of the base storage node cost, SSD caching/tiering is a must – if we can have reasonable redundancy at a reasonable cost.

        So it would be a nice feature for our use case to have a backup copy of the SSD tier data on magnetic drives, or even on a separate drive, just in case too many SSDs fail at once, since i don’t trust even latest model SSDs to be sufficiently reliable.

        Next system i’m building will have 4xOCZ SSD on RAID5 + 20x3Tb Cudas on RAID50.

        • matt says:

          > Next system i’m building will have 4xOCZ SSD

          Just stop right there. OCZ drives are crap and have been for a long time now. Buy Intel 3700 if you’re serious or if you want to be cheap, Samsung 840 Pro. Otherwise buy the enterprise eMLC drives from the likes of Toshiba/Hitachi and now recently Samsung too. They key to drive longevity and consistency is 30+% spare area and buying GOOD drives, not consumer junk.

          I expect there is a way to intercept the migration workload such that you can copy off move candidates to other media so if the SSD goes ‘poof’ you can try to recover.

          That said I think it would be really GREAT if BTier actually left the original blocks alone when doing the promotion and then periodically ran a ‘flush’ or ‘sync’ that would sweep thru a tier and checkpoint any modified blocks back down one level. There wouldn’t be any guarantees of data consistency (ie. if you try to fsck the device it may still not be valid).

          The most correct solution is to journal your filesystems to a reliable NON-btier device. If you’re exporting block storage, those clients too will need journal elsewhere or take the risk of data loss/corruption if your home-made “SAN” loses a tier.

          Reliable, consistent storage is SLOW storage. It’s just the nature of the beast.

  11. Pulsed Media says:

    A warning for everyone: I managed to get FS corrupted by a reboot.
    It looks like one should always sync + detach btier before a reboot.

    e2fsck gives warning was not cleanly unmounted, and we did try some manual block migrations as well – maybe that is part of the reason?

    Not sure what causes this, but what we did was just a shutdown -r now, while all applications were still running (including iSCSI daemon) expecting things to be cleanly shutdown, unmounted etc.

    Might be nothing particularly to do with btier – just the way we have things setup, so my point is that make sure you have all the necessary shutdown procedures in place.

    Tho it still worries me a bit that FS got corrupted since servers are known to crash occasionally, and since this is 40+TiB array it takes a while to fsck.

    • maru says:

      Which btier version where you using when you had this problem?
      What filesystem where you using? That e2fsck takes a long time on a 40+TiB volume is hardly a surprise.
      Please reconsider if you even want a single volume to be that large or maybe switch to a journaled filesystem like xfs?

      A problem that could potentially cause data corruption was solved in btier-1.1.0.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>