Announcing libradosfs

At CERN the disk storage is managed by a system built in-house called EOS (developed by CERN’s IT Data & Storage Services group). EOS manages data from the LHC experiments and also from “regular” users and power applications such as CERNBox, a file storage/sharing service for CERN users.

EOS uses an in-memory namespace. This allows it to achieve very high metadata performance but has the downside that when it is started, the namespace data needs to be read entirely from disk into memory. With the amount of data we have in EOS at CERN (~40PB and ~400 million files), this feature naturally affects the scalability of the system, which is taking around 30 minutes to boot at the moment.

Introducing libradosfs

A good part of my job here has been to create a new storage backend that should improve this. So, for a while now, I have been working on a new filesystem library that scales horizontally and keeps the very namespace stored on disk.
This library is based on the popular Ceph storage system and uses its RADOS (a distributed object store) client API. With lack of a more creative name, we have called it libradosfs.
The long term goal for libradosfs, as hinted in the paragraphs above, is to replace EOS’s backend and namespace implementation, leveraging also all the nice features of Ceph (replication, erasure coding, etc.).

If you are familiar with Ceph, you are at least aware of CephFS. So why aren’t we using CephFS instead? It is a fair question.
We started by implementing a small backend using CephFS and performed some tests but the fact that CephFS relies on metadata servers (MDS) and, since the code for using multiple MDS instances is not yet stable, using one single MDS represented a scalability issue for the amount of data and users we have (along with other limitations like having only one metadata pool per filesystem). Besides that, and equally important, by using directly the RADOS API, we have more freedom to tackle issues that affect our use-cases and more flexibility than what a POSIX-compliant system offers.

Thus, libradosfs should not be mistaken for CephFS. If you want a more generic, fast and POSIX-compliant filesystem implementation on top of Ceph, you should use CephFS. The good folks at Inktank (now Red Hat) are actively making it better and, who knows, maybe in the future CERN might use it too.
If you somehow have use-cases that are more close to CERN’s in terms of scaling horizontally, and especially if you need a simpler code base that you can quickly tweak, you may want to look into libradosfs.

So, to summarize its definition: libradosfs is a client-side implementation of a filesystem using RADOS, designed to provide a scale-out namespace and optimized directory/file access without strict POSIX semantics.

The implementation has been refined quite a few times always with the purpose of offering a more scalable and flexible design.
Writing about the implementation of the whole system would result in a very large post so I will just write about some parts of the core design below.

Implementation overview

libradosfs can associate any pools in Ceph with path prefixes, distinguishing between metadata and data pools. Metadata pools are used to store information about the directories so faster and replicated pools should be configured as metadata pools. Data pools store mostly file-related objects, so ideally erasure-coded pools should be used as data pools (for space efficiency).
Pools are associated with a prefix to offer flexibility in terms of cluster configuration: e.g. we can have testing pools associated with the prefix /test/ so that we can easily set which specific data should go into them.

The namespace and hierarchy are stored directly in the object store, a directory is an object whose name is the full path (which points to an inode). For example, /a/b/c/ means there will be four objects named with the prefix paths: /, /a/, /a/b/ and /a/b/c/. This way, instead of having to traverse a directory / -> a/ -> b/ -> c/, we can directly stat the object using the full path.
Files are implemented in a different way. They are represented by an entry in their parent directory’s omap (the omap is a key-value map associated with objects in RADOS) where the key determines the file name, e.g. file.notes.txt.
Both directories and files have an associated object that acts like an inode, holding the file’s contents. We find out about a directory’s inode from a specific entry in its path object’s omap (rfs.inode); when it comes to files, the value associated with their key will hold information such as the pool and the inode name. Inode objects in libradosfs use a UUID as their name rather than an index since this is more suitable for a distributed environment.

Directories’ entries are stored as an incremental log in the directories’ inode objects. For example, creating file /doc.odt , /notes.txt and deleting /doc.odt would create the following entries in the root directory’s log:
+name=’doc.odt’
+name=’notes.txt’
-name=’doc.odt’

The reason we have a log like this is so that we can quickly keep track of a directory’s contents incrementally, that is, every time we list the directory we read its contents from the last byte we had read previously rather than having to reread all the entries. (Before you worry, there’s a directory log compaction mechanism in place as well)

Files are written in chunks of a user-defined size for better IO and we also support inline files (for small files whose inode creation might not be worth it).
The files’ API offers synchronous and asynchronous write methods which use timed locks: instead of locking and unlocking the file every time it is written, it locks the file for a while once and renews the lock when needed. Writing becomes faster this way, since we skip having to lock/unlock the file on every write call. To prevent locking the file for too long, files are unlocked if they have been idle (not claimed for a while).

Filesystem’s objects have UNIX file permissions associated with them. There is a also a way to change the user/group id at any point in the filesystem so the actions performed on it are as if that user/group had done them. This means that permissions are verified when e.g. accessing a file, but the library does not have access control to restrict who really uses it. That is, the system uses the user/group id set on it but does not restrict who sets it. That is beyond the scope of the library as this can be achieved by having the library accessed via a gateway: for example EOS or XRootD check the permissions on their own already but can use libradosfs for the storage backend.

Among others, the API provides a parallel stat method which can be used to get information about a number of paths in the system. From the way files are designed, statting files that belong to the same directory is much faster when performed together than individually (because we have to read the omap multiple times in the latter). Thus in a use-case like listing a directory, using the Filesystem::stat with a vector of paths is much faster than looping through the entries, statting one by one (the stat method groups the paths per parent directory by itself, no need for a more specialized API).

There is a find method similar to UNIX find which looks for files/directories in a fast, parallel way, based on information like the size, name (with regex), attributes, etc.

Besides being able to set metadata in the form of extended attributes in directories or files, directories also support metadata about their entries, for information that is more intimate to this relation and which should not be set in the files directly. This metadata is also stored in the log, just like the entries themselves.

We also implemented a quota system as this is a common use case when managing a large storage system. The library does not enforce the quotas because the effect of exceeding the quota varies in definition and in the actions that should be taken; instead it includes an API to assign directories, users and groups to Quota objects, and to control the values associated with those. This feature is dependent on Ceph’s numops CLS which simplifies atomic arithmetic operations and we chose to include it in libradosfs so the quota system can be kept close to the filesystem.

Besides the features mentioned above, which are integrated in the library, there is also a FUSE client being developed at the moment.

Tools

libradosfs ships with a filesystem checker tool (libradosfs-fsck) and a benchmark (libradosfs-bench) for calculating the performance on a given cluster.

Performance tests

Although we cannot yet provide very detailed performance tests results, I gathered some figures using libradosfs‘s benchmark tool. The benchmark consists in counting how many files it can create and write synchronously for a given period. In this case, each benchmark ran for 5 minutes from my development machine (an i3 3.3 GHz dual core machine, 8 GB RAM running Fedora 22), using a cluster of 8 machines with 128 OSDs in total running the Ceph hammer release v0.94.3. The network connection is 1 Gbps and this also impacts the tests as you can notice for the larger files in the table.

Update:

After the announcement, I was spent some time benchmarking and checking how I could improve the benchmark results, basically trying to reduce the round trips to the server.

In the process I found out something weird: the faster cluster (using SSD disks, for metadata and inline files’ operations) was actually giving me worse numbers than the slower cluster (HDD disks), specifically the SSD cluster was twice as slow. So after digging for a while, I realized that it was actually the reading, not the writing that was slower and, more importantly, this was true not only for libradosfs but also for Ceph’s own rados benchmark!

In the end, the cluster admin investigated it and found out that the main reason was that TCMalloc was being run with the default cache size of 16 MB. Changing this value to 256 MB, the SSD cluster is now faster then the HDD one (as it should be) and the libradosfs‘s benchmark operations that use mainly this cluster are almost 4x faster than the old ones as seen in the new table below:

File size Avg files/sec
0 (touching a file) 170.12
1KB (inline file) 80.43
1MB (inline + inode) 10.94
500MB (inline + inode) 0.22
1GB (inline + inode) 0.11

Old numbers for reference:

File size Avg files/sec
0 (touching a file) 47.66
1KB (inline file) 31.17
1MB (inline + inode) 8.18
500MB (inline + inode) 0.21
1GB (inline + inode) 0.11

For additional information, the files were configured with an inline buffer of 1K so when writing more than 1K, both the inline buffer and the actual inode get written.
For reference, the number of possible synchronous write operations using RADOS with the above configuration is around 130 (but of course, creating/writing each file involves several IO operations in libradosfs).

As I mentioned, these are early numbers and there is surely room for improvements as reducing the number of trips to the server, but, although speed is always important, the most relevant aspect is that the numbers for more than 1 client should remain virtually the same (when each client writes in different directories).

Other Features

Here is a list summarizing some of the features we currently support:
* directory/file hierarchy
* multiple pool/path association
* direct inode creation and lazy association with a path
* quotas
* filesystem checker
* find method based on path, size and metadata/extended attributes
* inline files
* parallel statting
* symbolic links
* movable directories/files
* vector (parallel) read

Although libradosfs was designed with CERN’s use-case in mind, we tried to make it generic enough so that it may be useful for similar but non-High-Energy-Physics use-cases too. The library has been under development for a while now but it is not yet tested in production so you should not expect it to be a stable system at this point.

libradosfs is released under the LGPL license, in case you want to try it yourself or contribute to it, you can get the source and documentation here.

What a year!

What a crazy year this was! In 2013 many important events happened in my life that would make this a very busy year.
To start, I began the year looking for a new job after 4 years working for Igalia. This meant that I had to travel a lot and move (with Helena) from the place I felt like home (the city of Coruña), having to say good bye to many good friends.

This search also took me to the U.S.A. for first time where I met a very interesting company and people. Since Helena and I didn’t do our traditional travelling this year, going to San Francisco was definitely the most interesting trip of the year for me. I really want to visit it again some day together with Helena.
Then I ended up joining Red Hat, where I kept working with GNOME technologies — mainly on the Wacom related pieces — together with some of the best Open Source developers in the world. I also moved to Berlin, the city I am in love with, which meant fulfilling a dream we had for a few years. My dear friend Chris Kühl helped make this move smoother so I have to thank him here again.

After just a few months in Berlin, I received the positive result of an application to CERN that I had done before all this and I had to make yet another decision. We decided to do it and we moved out of Berlin just shortly after knowing that we will become a family of 3 next year! Our little girl Olivia will be born next March and we cannot express how excited we are about it!

Life in this region is very different from Berlin’s (not bad, just different) but CERN is a very unique place and I am enjoying the experience.
Our arrival here was also easier because of Quim and his wife Ana Marta, a couple of friends from University who really couldn’t have helped us more. Together with our good friend Nacho, they are really “5 stars” as we say in Portuguese 🙂
I need also to mention my parents who not only helped us with moving out of Spain but also drove all the way from Portugal to France in order to visit us and bring us our stuff.

Technically, I live in France, in a small town called St Genis Pouilly, close to CERN on the French side of the border but it’s really still Geneva’s area. A curious thing about Geneva is that its largest foreign community is the Portuguese. I hear more people speaking Portuguese at the supermarkets in here than in Algarve 🙂
One of the things I miss from Berlin is the possibility to easily ride a bike anywhere. In here it is dangerous (drivers are crazy and there’s no bike lanes) and less convenient (Berlin is flat, here it isn’t) but I found another physical activity to compensate a bit my sedentary job: I started playing squash and I love it!

As a result of all these changes, my personal projects got a bit neglected. I released only one new version of Skeltrack and OCRFeeder (actually I got a new version of OCRFeeder almost ready to ship) and I did a couple of quick hacks with the Leap Motion Controller.
The number of books I read was also lower than ever this year. I read a couple of books by Cory Doctorow and a spy thriller called The Shanghai Factor.

Not all things in 2013 were as great as my words might indicate. My grandmother (to whom I was very close) passed away a month ago. It was a very sad event, but she lived a long life and had her family beside her in every moment.

About 2014, my biggest wish is that everything goes well with the baby and Helena. I think I will probably have to miss some of the Open Source events I usually attend but I got a good excuse, right?
I hope it’ll be a quieter year than 2013 in terms of moving and that I can still dedicate time to my personal projects.

2013 was a year I will surely remember all my life. I am a lucky person to have had the opportunity of different experiences, to have friends in many places and to have my wife and family supporting me all the time.

I wish you all an excellent 2014!

Olivia in Helena's belly!

Interviewed by World of GNOME

World Of GNOME has interviewed me again, this time about Skeltrack, my role at Red Hat and Open Source at CERN.
If you would like to know more about those (there is even an animal shelter in the mix), check it out here.

Big Changes Again

My life has seen some big changes this year with getting a new job, moving to Berlin, etc. Well, the changes haven’t stopped yet.

Before I applied to Red Hat, at the beginning of the year, I had applied to a position at CERN (if you don’t know what CERN is, look it up, but part of the reason you’re addicted to the internet is because of it). CERN has two periods throughout the year where it accepts applications, I knew it wasn’t easy to be accepted and I had to move on with my life so I was happily living in Berlin and working for Red Hat.
Turns out I was accepted and I had a decision to make.

On one hand, I was very happy to finally be living in Berlin, Helena was enrolled in an intensive German course (and doing great), we were living in the nice neighborhood of Prenzlauer Berg and of course, I don’t need to tell you how great Red Hat is if you consider yourself a Free Software developer. On the other hand, I knew this opportunity with CERN would be hard to get again. So in the end I took the tough decision of leaving Red Hat and I have been working at CERN since last week. I am working on a project out of my comfort zone (and yes, it’s Free Software), but that’s part of the challenge.

I changed a unique company for a unique research center and I changed one of the cheapest, coolest cities in Europe for one of the most expensive in the world.

Regarding work and GNOME in particular, I will keep involved in it, even though my projects have been neglected with all the moving: I hope to finish the port of OCRFeeder to GI and to give some love to Skeltrack once I have time (and conditions: no internet at home yet…).

Even Bigger Changes

Oh, yeah, there’s something else I would like to share. I am in the first steps of what will surely be the biggest project I will ever develop: Helena and I are expecting a baby!
We just found out one week before moving out of Berlin. It surely doesn’t make all the moving easier but we can’t describe how happy we are with these news! It’s curious that we had to live in the “babyboom neighborhood” of Berlin to get a baby ourselves… there must be something in the water! 🙂

And that is all for now, let’s see what the next months bring!

Wacom’s fresh button assignment and GUADEC

In what comes to assigning buttons’ functions for the Wacom tablets in GNOME, the approach in the GNOME Control Center was the traditional tree-view: one button’s label per row, allowing to choose the functionality but requiring the user to mentally map the tablet’s buttons’ layout to the names in the tree-view.

Since we already have a help window, provided by GNOME Settings Daemon, which presents a tablet’s buttons layout in a realistic, visual way to the user, we decided to make it more powerful and assign the buttons directly from there! This way it is faster and more intuitive to set the buttons. Here is a video showing these nice new changes:

Another change is that the keyboard shortcuts are now captured by a new widget which supports also modifier-only shortcuts, meaning that now Ctrl, Ctrl+Alt, Shift, etc. can be easily assigned to buttons, allowing for more flexibility when mapping the tablet’s buttons to applications’ commands. As shown in the video, the old GtkTreeView was also replaced by a nicer GtkListBox (which also makes use of this shortcut capture widget).

Going to GUADEC

That’s right, for the fifth year now, I am going to GUADEC! Besides attending the conference, it will be also a good chance to have a beer with old friends and team mates from Red Hat, who I only interact with on IRC.

See you in Brno!