Lessfs2 development has started

Whenever we create something new for the first time, there is always this feeling afterwards that you could have done better. Lessfs is no exception to the rule. And for some time I have been wanting to rewrite the code. One of the weaknesses of the current lessfs code is that it uses the high level Fuse API. This API only works with NFS to some extend, but it is far from optimal. Metadata operations are not as fast as they can be when Lessfs would use the low level API. And there are a number of features that Lessfs1 still lacks.

How will lessfs2 be different?

Obsoleted:

  • Support for tokyocabinet as datastore has been removed.

New features:

  • Support for snapshots
  • Uses the FUSE low level API
  • Fast inode cloning
  • Support for hamsterdb and maybe BerkeleyDB to store the metadata as well as tokyocabinet
  • Self healing raid*
  • Multimaster as well as master slave.
  • Automatic storage tiering based up-on usage, per chunk.

Self healing raid

Let me explain what I mean with ‘self healing’ raid. Traditional file systems will store any chunk of data that it is requested to store. Since lessfs only stores unique data chunks it will be a catastrophe when this one chunk of data is lost or corrupted. Since lessfs uses strong hashes to identify each chunk of data and with this hash it can also detect data corruption. However to be able to repair the corruption it will need to add some redundancy to the data. Traditional raid systems are no longer safe when large amounts of data are stored on todays high capacity SATA drives. Although we still rely on this today, data corruption will occur and remain unnoticed. Lessfs will support redundancy mechanisms that will allow a user to define how many copies of a data chunk have to be stored. Research indicates that keeping 3 copies of your data is needed to be safe. Anyway, lessfs will allow you to decide if and how much parity will be added to the data. And when possible Lessfs2 will repair corrupted chunks.

Lessfs2 will also support additional backends to tokyocabinet. At least support for hamsterdb will be added. BerkelyDB support will be added when there is enough demand for it.

Current development status

The lessfs2 project has been created on sourceforge. I have uploaded a lessfs2 pre alpha release which is in essence a port from lessfs to the low level FUSE API. People who are interested in metadata performance can test it. This code does not yet pass all the Posix tests but it does already do everything you would expect a filesystem to do. If you do decide to test it then please select the file_io backend, otherwise it will not work. Many things will change in the near future to the metadata structures. This is needed to implement snapshot support and other new features.

This entry was posted in Uncategorized. Bookmark the permalink.

15 Responses to Lessfs2 development has started

  1. Alex says:

    Hi Mark, nice to read news from you !

    Will of course follow this new version and try it as of the first “stable” will be released ! It’s nice to see that lessfs is still growing up, as it’s a super scalable and easy to setup system.

    Happy new year btw :D

  2. Hubert Kario says:

    >Research indicates that keeping 3 copies of your data is needed to be safe.

    Could I ask for source on this?

  3. Unak says:

    Fast inode cloning – what is it ?

    • maru says:

      You will be able to make an instant copy of a file. This is relatively easy since it only requires a copy of some metadata and the inode structure. The actual copy of the metadata can be done while the file is already in use, updates are done on the new (cloned inode) while reads are done from the original inode. So cloning will appear to be instant from a user perspective.

  4. cw says:

    I set 2 up using the same process as I’ve done for lessfs1 on 2 other systems and I’m in the intit script I’m getting :
    $Starting lessfs: User defined signal 1

    and in syslog:
    lessfs[522]: fopen (null) failed.

    all the files seem to be in the right places and the mountpoint exists. Do I need to do something different? how can I make it give a better error?

  5. Morten says:

    Hi Mark

    As always, nice work.

    Any chance of supporting ACL on lessfs ?

  6. SR says:

    I know that collision probability is very low (I’ve seen the COLLISION.probability file with math). But is it possible to implement something similar to ZFS option dedup=verify in new version? With this option ZFS verifies the blocks byte by byte after the hash match to see if they are really identical.

    • maru says:

      Yes, this is on the list because more then a few people asked for it. Did you ever try the dedup=verify option with ZFS? What I have learned so far is that the performance penalty is severe. The fact that you can use a lighter hash does not compensate the extra IO.

      • SR says:

        No I haven’t tried it myself. But I think the performance penalty wont be that bad. Extra reads will occur only when the block with the same hash is found. I think performance is not that important for file system that is used mostly for backup. And this feature is optional.
        To gain more performance all check can be performed when there are no other IO operations. For example you write data, check the hashes, see the match with other hashes, write it to journal. When the write is finished you start comparing the data with the same hashes. If there are other IO operations you pause the check and write data and resume check after that. That way the user wont see mush performance loss.

  7. wvoice says:

    Mark,

    I thought TC support was pulled from 2.x. The pre-alpha still seems to require it. Will there be a 2.x release based on hamsterdb soon? Or is all the focus on 1.3.1?

    Anything I can do to help keep 2.x moving along?

    Thanks,

    Mike

    • maru says:

      TC support for storing the actual data has been pulled from 2.x
      Both hamsterdb and tc will be used for storing the meta data.

      The focus in not on 1.3.x but I will support the 1.x series for some time to come.
      At least until 2.x is ready and stable.

      I will put a new 2.x development release on sourceforge in a short while.
      This new code has a big chunk of the snapshot support code implemented.

  8. Filbli says:

    I’d like to know about your experience moving from fuse high-level to fuse low-level. Is it just a switch from pathnames to inodes, or is there more involved? I have a read-only fuse filesystem and am considering switching it to fuse low-level, but so far I only have the bare api docs to go by.

  9. naguz says:

    You might already have answered this in a previous post, but why FUSE? It is frowned upon by many (Linus included, unless he’s misquoted), and thus I tend to think “probably not without reason”. As you probably have had people raise objections to FUSE before, what is you take on it?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>