EPRD – An eventually persistent ramdisk / disk cache

Today I uploaded a kernel project that I call eprd. This kernel module allows you to create a persistent ram disk. It can also be used to use DRAM to cache disk IO. Of course this comes with all the dangers that volatile ram introduces. EPRD does however support barriers. When barriers are enabled any sync() on the file system will result in EPRD flushing all dirty buffers to disk. It also allows to set a commit interval for flushing dirty buffers to disk.

This project can be useful whenever one needs more IOPS in a non critical environment. There is more to this project though. I am working on a kernel based high performance deduplicating block device. This project will share code and ideas from EPRD as well as Lessfs.

Enjoy,

Mark Ruijter

This entry was posted in Uncategorized. Bookmark the permalink.

33 Responses to EPRD – An eventually persistent ramdisk / disk cache

  1. Pingback: Новости компьютерного мира - EPRD – реализация RAM

  2. Pingback: EPRD — реализация RAM-диска, обеспечивающего постоянное хранение данных : Записки начинающего линуксоида

  3. Pingback: EPRD — реализация RAM | AllUNIX.ru — Всероссийский портал о UNIX-системах

  4. dimiz says:

    Hi Maru
    tahnks for your new great project!!
    I have try to find a lot of solution for VM filesystem implementation until your project!!
    What do you think about a new feature for EPRD like multi tier and caching the most used data block?
    My idea is to use EPRD as raw device for LVM and KVM virtualization what do you think about? (I know its not safe)
    Thanks so much

    • Mark Ruijter says:

      Hi Dimiz,

      The latest version of EPRD will allow you to create a storage solution for example vmware or kvm. EPRD can now transparently export any disk. No need to reformat the drive. Eprd now supports sector sizes ranging from 512 to 4096 bytes.

      Testing with SCST + EPRD to export an eprd drive to Vmware was very promising. A Windows 2008 guest machine running on Vmware was capable of doing 20k IOPS on a SATA drive. The SATA drive can handle 150 IOPS.

      Multi tiering is a possibility. Even with de-duplication.
      I’ll give it some thought.

  5. dimiz says:

    Hi Mark
    thanks for reply, very promising solution!!
    My Idea on multi tier solution is for example RAM+SSD+HDD and destagind the data block using some sort of payload for device and most wanted block data (please take look to this discussion https://bbs.archlinux.org/viewtopic.php?id=113529).
    At the moment i have try to use EPRD as Physical Volume on LVM and works! so i can use it as raw volume for my VM.
    The best solution is Multi-Tier and Dedupe Filesyetm!! But is another story.
    Thanks so much for your work

  6. Thomas Mueller says:

    in this area, bcache is an interessting project too:

    http://bcache.evilpiepirate.org/
    http://lwn.net/Articles/394672/

    • maru says:

      Hi Thomas,

      bcache is surely interesting. As well is flashcache.
      The problem with bcache is that it comes with it’s own kernel code instead of a clean patch that works with most kernels.
      Flashcache works well but requires to reformat your drive as I recall.

  7. Maza says:

    Hi,

    There doesn’t seem to be much documentation. Where can I find an explanation of the /etc/lessfs.cfg file?

    thanks.

  8. Mark says:

    Hi,

    I am trying to use hamsterdb with 1.5.9.
    I have used the default lessfs.etc-hamster, and created the directories.
    mklessfs -f -c /etc/lessfs.cfg works fine and creates files.
    lessfs /etc/lessfs.cfg /usr/lessfs does nothing and in /var/log/messages it says:
    Apr 19 12:24:01 localhost lessfs[10648]: ham_env_open failed to open the databases() returned error -8

    What am i doing wrong?

    Thanks.

  9. Maza says:

    How do I report a bug… in /var/log/messages:

    Apr 19 14:04:29 localhost lessfs[14514]: The filesystem is clean.
    Apr 19 14:04:29 localhost lessfs[14514]: Last used at : Thu Apr 19 14:04:22 2012
    Apr 19 14:04:48 localhost lessfs[14514]: Please report this bug and include hash and offset : 98BD2F5D0D748B83DE6A8A63271139DCF00D4AA6C93E6156
    Apr 19 14:04:48 localhost lessfs[14514]: file_tgr_read_data : read block expected offset 5263362560, size 145, got size 0

  10. Jean says:

    This sounds unbelievably useful, Mark, my congratulations!

    This is somewhat like the many projects aimed at caching HDDs with SSDs, however none of the others clearly separates the two tasks of caching/optimizing the writes, and caching/optimizing the reads, and they are actually mainly concerned in optimizing the reads. Yours is the only one concerned only with writes, which is more interesting for me, and it seems you did a great job! (I have not tried it yet).

    – I would suggest you to consider the existence of SSDs and to combine your EPRD with another layer, let’s call DW (dual writer): you would use two backend devices, one SSD and one HDD. You write stuff simultaneously to the SSD and HDD, you return completion to the application when the SSD returns completion, but you unmap the data from the SSD as soon as the HDD returns completion so to re-use that SSD space later. In this way you will be able to sustain very long bursts of I/O at the speed of an SSD, and eventually ending up on the HDD. You will need to slow down the I/O only in case the HDD lags behind so much that the space on the SSD is exhausted.

    – I don’t know if you already do it, I suggest to write data to the HDD in LBA order (at each flush) so to minimize seeks (i.e.: reimplement the blockdevice queue scheduler). If you eventually create the Dual Writer there is an additional optimization you can do: X upstream flushes should result in X flushes to the SSD downstream, this needed for barrier correctness, but for the HDD you can do that in less than X flushes, and at every flush to the HDD you can sort blocks in LBA order, so you actually can sort block across upstream flush requests (i.e. across barriers), eventually obtaining a much higher HDD throughput.

    – Also I suggest to publicize your project into the dm-devel mailing list ASAP, because Mike Snitzer has said he is working on bringing some caching/HSM mechanism thing to DM, see:
    https://www.redhat.com/archives/dm-devel/2012-March/msg00069.html
    try to push your EPRD to him! Especially if you can implement the Dual Writer so that it can be sold as an RAM+SSD hybrid caching mechanism for writes.

    Now speaking of lessfs: I would really like to see snapshotting in there, or maybe the BTRFS_IOC_CLONE (aka cp –reflink) which can be used to implement a poor man’s snapshotting mechanism (i.e. not instantaneous for a whole directory hierarchy, but close enough), but can also be used to clone single files which is extremely useful per-se. Please don’t stop the 1.x project just yet if it is anyhow possible to implement BTRFS_IOC_CLONE in there (maybe with a small kernel patch for fuse to be pushed upstream).

    Thanks for your great work

  11. Roman says:

    Hi,
    Im trying to setup eprd (0.1.7), but:
    [root@srv-moni-01 eprd-0.1.7]# insmod ./eprd.ko
    insmod: error inserting ‘./eprd.ko': -1 Device or resource busy

    and in /var/log/messages:
    Apr 26 13:38:39 srv-moni-01 kernel: EPRD : Version : 0.1.7
    Apr 26 13:38:39 srv-moni-01 kernel: EPRD : misc_register failed for control device

    [root@srv-moni-01 eprd-0.1.7]# uname -a
    Linux srv-moni-01.local 2.6.39-100.5.1.el6uek.x86_64 #1 SMP Tue Mar 6 20:26:00 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

    What’s wrong?
    Thx.

    • maru says:

      Good question.

      Can you send me the output of ls -l /dev/

      • Roman says:

        The problem was resolved.
        I change definition of EPRD_CTRL_MINOR (in eprd.h) from 235 to 2211:
        crw-rw—-. 1 root root 10, 2211 Апр 26 16:04 eprdcontrol

        The minor number 235 is in use in the system by autofs:
        crw-rw—-. 1 root root 10, 235 Апр 26 13:50 autofs

  12. Roman says:

    Mark,
    I send the output by email.

  13. Renich says:

    Hello,

    Congratulations on this project. Are you aware of: http://code.google.com/p/compcache/wiki/zramperf

    At first sight, the project is abandoned but it is not. It is in the staging modules area already and, recently, there is a lot of talk of compcache and zram in the kernel mailing lists.

    I think you guys have things in common.

  14. Riccardo says:

    Maru,
    Do you think that eprd combined with drbd can be used as mirrored cache in ha iSCSI san?

  15. Kirill says:

    On Fedora 17
    [root@ws04 eprd-0.4.2]# uname -a
    Linux ws04.local 3.5.0-2.fc17.i686.PAE #1 SMP Mon Jul 30 15:18:54 UTC 2012 i686 i686 i386 GNU/Linux

    [root@ws04 eprd-0.4.2]# ./eprd_setup -f /home/persramdisk.img
    Using an existing blockdevice
    datafile=/home/persramdisk.img
    commit_interval = 0
    Blockdevice size = 1073741824
    ioctl EPRD_SET_DTAFD failed on /dev/eprdcontrol
    ioctl EPRD_SET_DTAFD failed

    • maru says:

      Did you initially create the device with -c ?

      • Kirill says:

        [root@ws04 eprd-0.4.2]# ./eprd_setup -f /home/persramdisk.img -s 1G -c

        creating a new blockdevice
        datafile=/home/persramdisk.img
        commit_interval = 0
        Blockdevice size = 1073741824
        ioctl EPRD_SET_DTAFD failed on /dev/eprdcontrol
        ioctl EPRD_SET_DTAFD failed

        • maru says:

          Some more questions:
          Has the file been created?
          Which kernel / distro?

          Anything special when you look at dmesg?

          • Kirill says:

            [root@ws04 home]# ls -al
            итого 16
            drwxr-xr-x. 4 root root 4096 авг. 14 19:46 .
            drwxr-xr-x. 21 root root 4096 авг. 15 09:57 ..
            drwx——. 51 kirill kirill1 4096 сент. 29 2011 kirill
            -rw——- 1 root root 1073741824 авг. 14 19:46 persramdisk.img
            drwx—— 24 user user 4096 мая 28 14:22 user

            from dmesg
            [ 0.000000] Kernel command line: initrd=FedoraXX_32/initramfs.ws04 ro root=/dev/sde1 netroot=iscsi:@192.168.184.220::3260::iqn.2010-11.bitel.office.vs01:ws04-fc16 iscsi_initiator=iqn.2010-11.bitel.office.vs01 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=ru_RU.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us plymouth.enable=0 selinux=0 BOOT_IMAGE=FedoraXX_32/vmlinuz.ws04

  16. eric says:

    just find some projects :http://sourceforge.net/projects/pramfs/
    you can see.
    now eprd still can’t use for product…?
    thanks.

  17. Sahil says:

    Hi Maru

    I’m working on a similar project and I was wondering if you ever had to face issues reading the image file within the kernel? I’m trying to use vfs_read to read my image file but the call to read freezes! My read routine looks like yours and I don’t see you doing anything different. But I’m not sure what’s causing my reads to hang.

    Sahil

  18. Roemer2201 says:

    I think I have to report a Bug:

    lessfs[1123]: Please report this bug and include hash and offset : E855F9952B32424D751A236B84F43F139B3A7E7FCF23BE61
    lessfs[1123]: file_tgr_read_data : read block expected offset 10784662016, size 227, got size 0

    Do you need more Information about my config?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>