Announcing libradosfs

At CERN the disk storage is managed by a system built in-house called EOS (developed by CERN’s IT Data & Storage Services group). EOS manages data from the LHC experiments and also from “regular” users and power applications such as CERNBox, a file storage/sharing service for CERN users.

EOS uses an in-memory namespace. This allows it to achieve very high metadata performance but has the downside that when it is started, the namespace data needs to be read entirely from disk into memory. With the amount of data we have in EOS at CERN (~40PB and ~400 million files), this feature naturally affects the scalability of the system, which is taking around 30 minutes to boot at the moment.

Introducing libradosfs

A good part of my job here has been to create a new storage backend that should improve this. So, for a while now, I have been working on a new filesystem library that scales horizontally and keeps the very namespace stored on disk.
This library is based on the popular Ceph storage system and uses its RADOS (a distributed object store) client API. With lack of a more creative name, we have called it libradosfs.
The long term goal for libradosfs, as hinted in the paragraphs above, is to replace EOS’s backend and namespace implementation, leveraging also all the nice features of Ceph (replication, erasure coding, etc.).

If you are familiar with Ceph, you are at least aware of CephFS. So why aren’t we using CephFS instead? It is a fair question.
We started by implementing a small backend using CephFS and performed some tests but the fact that CephFS relies on metadata servers (MDS) and, since the code for using multiple MDS instances is not yet stable, using one single MDS represented a scalability issue for the amount of data and users we have (along with other limitations like having only one metadata pool per filesystem). Besides that, and equally important, by using directly the RADOS API, we have more freedom to tackle issues that affect our use-cases and more flexibility than what a POSIX-compliant system offers.

Thus, libradosfs should not be mistaken for CephFS. If you want a more generic, fast and POSIX-compliant filesystem implementation on top of Ceph, you should use CephFS. The good folks at Inktank (now Red Hat) are actively making it better and, who knows, maybe in the future CERN might use it too.
If you somehow have use-cases that are more close to CERN’s in terms of scaling horizontally, and especially if you need a simpler code base that you can quickly tweak, you may want to look into libradosfs.

So, to summarize its definition: libradosfs is a client-side implementation of a filesystem using RADOS, designed to provide a scale-out namespace and optimized directory/file access without strict POSIX semantics.

The implementation has been refined quite a few times always with the purpose of offering a more scalable and flexible design.
Writing about the implementation of the whole system would result in a very large post so I will just write about some parts of the core design below.

Implementation overview

libradosfs can associate any pools in Ceph with path prefixes, distinguishing between metadata and data pools. Metadata pools are used to store information about the directories so faster and replicated pools should be configured as metadata pools. Data pools store mostly file-related objects, so ideally erasure-coded pools should be used as data pools (for space efficiency).
Pools are associated with a prefix to offer flexibility in terms of cluster configuration: e.g. we can have testing pools associated with the prefix /test/ so that we can easily set which specific data should go into them.

The namespace and hierarchy are stored directly in the object store, a directory is an object whose name is the full path (which points to an inode). For example, /a/b/c/ means there will be four objects named with the prefix paths: /, /a/, /a/b/ and /a/b/c/. This way, instead of having to traverse a directory / -> a/ -> b/ -> c/, we can directly stat the object using the full path.
Files are implemented in a different way. They are represented by an entry in their parent directory’s omap (the omap is a key-value map associated with objects in RADOS) where the key determines the file name, e.g. file.notes.txt.
Both directories and files have an associated object that acts like an inode, holding the file’s contents. We find out about a directory’s inode from a specific entry in its path object’s omap (rfs.inode); when it comes to files, the value associated with their key will hold information such as the pool and the inode name. Inode objects in libradosfs use a UUID as their name rather than an index since this is more suitable for a distributed environment.

Directories’ entries are stored as an incremental log in the directories’ inode objects. For example, creating file /doc.odt , /notes.txt and deleting /doc.odt would create the following entries in the root directory’s log:

The reason we have a log like this is so that we can quickly keep track of a directory’s contents incrementally, that is, every time we list the directory we read its contents from the last byte we had read previously rather than having to reread all the entries. (Before you worry, there’s a directory log compaction mechanism in place as well)

Files are written in chunks of a user-defined size for better IO and we also support inline files (for small files whose inode creation might not be worth it).
The files’ API offers synchronous and asynchronous write methods which use timed locks: instead of locking and unlocking the file every time it is written, it locks the file for a while once and renews the lock when needed. Writing becomes faster this way, since we skip having to lock/unlock the file on every write call. To prevent locking the file for too long, files are unlocked if they have been idle (not claimed for a while).

Filesystem’s objects have UNIX file permissions associated with them. There is a also a way to change the user/group id at any point in the filesystem so the actions performed on it are as if that user/group had done them. This means that permissions are verified when e.g. accessing a file, but the library does not have access control to restrict who really uses it. That is, the system uses the user/group id set on it but does not restrict who sets it. That is beyond the scope of the library as this can be achieved by having the library accessed via a gateway: for example EOS or XRootD check the permissions on their own already but can use libradosfs for the storage backend.

Among others, the API provides a parallel stat method which can be used to get information about a number of paths in the system. From the way files are designed, statting files that belong to the same directory is much faster when performed together than individually (because we have to read the omap multiple times in the latter). Thus in a use-case like listing a directory, using the Filesystem::stat with a vector of paths is much faster than looping through the entries, statting one by one (the stat method groups the paths per parent directory by itself, no need for a more specialized API).

There is a find method similar to UNIX find which looks for files/directories in a fast, parallel way, based on information like the size, name (with regex), attributes, etc.

Besides being able to set metadata in the form of extended attributes in directories or files, directories also support metadata about their entries, for information that is more intimate to this relation and which should not be set in the files directly. This metadata is also stored in the log, just like the entries themselves.

We also implemented a quota system as this is a common use case when managing a large storage system. The library does not enforce the quotas because the effect of exceeding the quota varies in definition and in the actions that should be taken; instead it includes an API to assign directories, users and groups to Quota objects, and to control the values associated with those. This feature is dependent on Ceph’s numops CLS which simplifies atomic arithmetic operations and we chose to include it in libradosfs so the quota system can be kept close to the filesystem.

Besides the features mentioned above, which are integrated in the library, there is also a FUSE client being developed at the moment.


libradosfs ships with a filesystem checker tool (libradosfs-fsck) and a benchmark (libradosfs-bench) for calculating the performance on a given cluster.

Performance tests

Although we cannot yet provide very detailed performance tests results, I gathered some figures using libradosfs‘s benchmark tool. The benchmark consists in counting how many files it can create and write synchronously for a given period. In this case, each benchmark ran for 5 minutes from my development machine (an i3 3.3 GHz dual core machine, 8 GB RAM running Fedora 22), using a cluster of 8 machines with 128 OSDs in total running the Ceph hammer release v0.94.3. The network connection is 1 Gbps and this also impacts the tests as you can notice for the larger files in the table.

File size Avg files/sec
0 (touching a file) 47.66
1KB (inline file) 31.17
1MB (inline + inode) 8.18
500MB (inline + inode) 0.21
1GB (inline + inode) 0.11

For additional information, the files were configured with an inline buffer of 1K so when writing more than 1K, both the inline buffer and the actual inode get written.
For reference, the number of possible synchronous write operations using RADOS with the above configuration is around 130 (but of course, creating/writing each file involves several IO operations in libradosfs).

As I mentioned, these are early numbers and there is surely room for improvements as reducing the number of trips to the server, but, although speed is always important, the most relevant aspect is that the numbers for more than 1 client should remain virtually the same (when each client writes in different directories).

Other Features

Here is a list summarizing some of the features we currently support:
* directory/file hierarchy
* multiple pool/path association
* direct inode creation and lazy association with a path
* quotas
* filesystem checker
* find method based on path, size and metadata/extended attribtues
* inline files
* parallel statting
* symbolic links
* movable directories/files
* vector (parallel) read

Although libradosfs was designed with CERN’s use-case in mind, we tried to make it generic enough so that it may be useful for similar but non-High-Energy-Physics use-cases too. The library has been under development for a while now but it is not yet tested in production so you should not expect it to be a stable system at this point.

libradosfs is released under the LGPL license, in case you want to try it yourself or contribute to it, you can get the source and documentation here.

Two Weeks in Japan, Part 6 (The End): Tokyo vol. III

This article is part of the “Two Weeks in Japan” series and follows Two Weeks in Japan, Part 6: Tokyo vol. II

For our third day in Tokyo we were visiting Odaiba, an artificial island in Tokyo Bay. We got to the island early using the Yurikamome train, a fully automated train without drivers on board. I was surprised to see that the train does not run on rails but has wheels with tires instead (so maybe there is a better name for it than train)!

In Odaiba we went to the Sega Joypolis, a theme park by the well known games company Sega. The park has many nice attractions like a half-pipe (with a series of snowboard-like devices/wagons where two people can stand at a time), many enhanced video-games (for example a rollercoaster where we had to shoot zombies) or an experience inspired by a horror movie. This horror experience was based on the movie Sadako 3D (with a terrifying girl like the one in the Ring movie), we were taken to a dark room with a few computer screens that suddenly started malfunctioning! Our guide (speaking Japanese) tried to turn them off and had to unplug to do so but they quickly turned on again (oooohhh)! Helena and I were very tired (it was evening already) so we were not really excited by the whole scary experience and this must have been so clear in our expressions that the guide switched to English and told us: Very dangerous! Very very dangerous! Run run!

Another thing that I enjoyed seeing was that the Joypolis had many interactive installations using methods similar to the ones I was working with at the time. Check out this live hair change to make queuing less boring:

Love Helena’s air style!

The Joypolis is an indoors park and before we knew, night had fallen so we just had dinner and went back to the hotel.

No fear, Gundam’s here!

The next day we headed to Yokohama, to visit the rāmen museum and later to a town called Kamakura. I had seen the rāmen museum in some movies and I had three wrong assumptions about it: 1) it’s in Tokyo (it’s not, it is in Yokohama), 2) it’s an open air place (it’s in a basement floor) and 3) it’s a big place (it’s rather small). Still, I liked it because we learned a bit about rāmen, it is possible to try several types of it and learn from which regions it comes from, the atmosphere is good and there was a magic/juggling show going on.

Rāmen museum under a beautiful sky

Rāmen museum under a beautiful sky

We also took the chance to visit the Yokohama’s China town with its complex colors and architecture contrasting with the Japanese ones. We entered a Chinese temple and I started taking pictures as I saw no sign forbidding it. Suddenly a lady starts yelling at me, telling me to stop taking photos and basically pushed me out the temple… This also contrasted with the laid back and respectful attitude of the guards in the Japanese temples that had asked me to stop taking photos before. In my defense, after almost two weeks in Japan, I was quite used to look for signs forbidding photos but I hadn’t seen any when entering this Chinese temple because we entered through a side door and the sign was hanged outside over the main door.

After Yokohama, we went to Kamakura, a small town whose main attraction is the giant Buddha statue (daibutsu) with almost 14 meters of height. The statue is impressive and the place is quite peaceful so I definitely recommend it.

Giant buddha statue

Peaceful place

That night we went back to the nice rāmen place I mentioned in the first volume of the Tokyo’s part, in Ikebukuro (I wish I could remember the name of the place…), and got back to our hotel to get ready for our last day in Tokyo.

For our last day in Tokyo, we wanted to check out the fish market but we realized that it had strict rules to visit it and we had to be there very early, so we decided to take it easy (we were tired for the almost two weeks of moving around) and just went there a bit later in the morning. Naturally, the whole frenzy fish selling was no longer taking place so we just bought a few things in some small stands there like nori sheets (they were way cheaper there than back at home) and some beans which we thought were the nice tea we had tried in a few restaurants. When we got home and tried the tea, the beans were tinier than we remembered and they looked like coffee beans (the drink also tasted a bit like very weak coffee) so we just thought that we had bought some very bad coffee. The funny thing is that one year after the trip to Japan, we moved to the Geneva area (where I am currently working at CERN) and shortly after a German PhD student named Christian started working in my office, he knows a lot about tea and Japan (even speaks the language) and he told me those beans were actually roasted barley tea (or mugi cha). Mystery solved!

After the fish market, we went to the nice district of Asakusa, to visit Sensō-ji, one of the most impressive temples in our trip. In there we did the traditional custom of giving a donation and getting a small note telling your fortune. About mine, I won’t disclose everything but among some nice things it said “Building a new horse and enlarging are both good”. No kidding!

A buddhist temple in Asakusa


Close to the temple, we went to a nice sushi restaurant, with a conveyor belt surrounding a chef making the sushi in real time. We had been to sushi restaurants with the conveyor belt in other places in Europe and they all followed a “flat-rate” model (pay one fixed price, eat all the sushi you want), however, in Japan the common thing is that each dish has a color which indicates its price and in the end they sum it all and you pay. Still, the sushi was great and not that expensive.
Once we finished visiting Asakusa, we headed to the Tokyo University through a nice park with an impressive pond called Shinobazu, full of waterlilies, fish, ducks and turtles.

Shinobazu pond.

Shinobazu pond.

The last night in Tokyo we changed again to the last hotel of the trip (Narita U-City), this one was outside of Tokyo close to the Keisei Narita Station, on the way to the airport because we figured that the next day it would be simpler and faster to get the airport from there, and being away from Tokyo it means it was also cheaper.

The next day, we started our trip early and we saw a very long queue waiting for the KLM’s desks to open. So we waited, and waited, and waited and realized that our flight was delayed… We must have waited more than 3 hours in the queue which never got smaller.
Eventually they started opening the desks but it was clear that they were very badly organized. Finally, after all that time waiting, when we got to the counter, the lady tells us that she thought our backpacks were too big for hand luggage. I replied they were actually smaller than they looked and that we traveled all the time with them. She said that we had to try to fit them in the appropriate metal frame for the matter and we could go there, check and come back to tell her the verdict. I told her that the metal thingy was out of her sight and that maybe she wanted to come with us to check, otherwise she would have to just trust me and so far, she hadn’t wanted to do that. She looked at me with an expression of “Oh! So someone could actually lie to me!?” (Japanese people, always so innocent) and followed us. The bags fit and we got our tickets!
I was worried that the plane would be so delayed that we would miss the flight from Amsterdam to Barcelona, not only because of the hassle but also because from although the flight from to Barcelona was part of our ticket, once we arrived in Barcelona, we had a flight home to Coruña which was bought independently. Luckily, once we landed in Amsterdam, some assistants were waiting for everyone who was going to Barcelona as the plane had already boarded but was waiting for the people from Tokyo. We arrived with time in Barcelona but guess what, our flight to Coruña was also delayed (but at least we didn’t miss it).
All summed up, we had been traveling for more than 24 hours and needless to say, we were dead tired. I even remembered waking up the next morning with the feeling of having slept really well like I had not in ages and thinking “it is good to be home”.

Photo of Helena and I

Finally I am finishing this series of articles. Since the trip took place, and I started writing its articles, almost three years passed by (!) but I got a good excuse as a lot of things happened. I have changed to a different country (and job) twice, and I am now a father, so far an incredibly rewarding, exciting and ongoing journey.

Until the next time!

Going to FOSDEM 2015

Tomorrow I am flying to Brussels to attend FOSDEM for the 8th time!
It is amazing to see how much the event grew in these 8 years and I am looking forward to having another great weekend of interesting presentations, meeting old friends and sipping tasty beer.

I need to thank CERN for making this trip possible and if you want to find out about my current project there (soon to be announced), do let me know.

See you in Brussels!


Happy 2015!

Helena and I have just come back from the holidays with our family in Portugal and I would like to tell you how 2014 was a very good and special year in our lives. The big event that will make us never forget this past year was of course the birth of our daughter Olivia. Everyone who has kids will tell you how nice it is to have them and they’re right! All the tiresome, stress and lack of sleep is forgotten when we see her smile every morning.

Helena and I love to travel and have made at least a big trip every year for a few years. With the baby, those trips have to be shorter but since she was born we’ve been already to Portugal (twice), Spain (visiting our old colleagues in A Coruña) and the U.K. (more particularly London). That thing that people tell you about how having a child changes one’s perspective on many things is also very true for traveling. Olivia can be very easily awaken by noises so now we realize how noisy some cars and motorcycles are… London was awful in that regard. The underground was noisy as hell, including the very loud voice warnings. Also, as a big European capital, I was expecting its public sites to be accessible but even in the emblematic Victoria Station there was no elevator to access the underground. The sad thing is that while the stroller is a temporary annoyance for us, people in wheelchairs have to cope with that permanently.
We’re very curious about visiting Berlin with the baby to check those annoyances in there (because I seem to remember the underground being more silent and accessible) so that’s a trip we might do this year.

The book count kept low this year: I read 3 books and started another one which I haven’t finished yet (REAMDE by Neal Stephenson).

Even with the lack of time due to the baby, thanks to my wonderful wife I still keep playing squash and attending the CERN Micro Club once a week. Despite the awkward name, this is one of CERN’s many clubs and is concerned with technology, having several sections. I am part of the Robotics section in the club where we’ve been building a 3D printer.
This means less time for my side projects so this year, unfortunately there was only a couple of versions of OCRFeeder and no new version of Skeltrack.
I think that the only solution for OCRFeeder is to eventually have new contributors if there is an interest in keeping the project alive.

About Skeltrack, although its development was stalled during most of 2014, my friend Iago keeps improving it for his Master’s Thesis, and I had a lot of emails from people who are using it. I even visited one of them at EPFL who is using the project with his own 3D cameras which means that having a device-agnostic library was a good decision.

As for my job at CERN, I am finishing what I have been working on so I hope to talk about it in more detail soon.

This year’s donations went to the following places mainly (apart from the EFF, to whom donate every time I buy a Humble Bundle):
Wikipedia: don’t think I need to explain its relevance;
GNOME Builder: because this great guy was bold enough to quit his job in order to make an awesome and long needed IDE for GNOME (maybe you can still donate!);
Chão dos Meninos: an association from my hometown who helps children in risk — I always used to donate to big international projects such as Wikipedia and EFF but last year I realized that, since I don’t pay taxes in my country (because I live abroad), one way of contributing a small bit is to donate to an association such as this one.

I still do not know what 2015 will bring but I do hope that the tendency continues and it turns out to be a great year again!

Have great 2015 everyone!

photo 2

OCRFeeder 0.8.1

Taking advantage of the holidays, I have been dedicating some time to my side projects so today I am giving you OCRFeeder version 0.8.1!

The last OCRFeeder version had a very important change which was the port to GObject introspection and I was already expecting a few bugs to pop up here and there. That proved to be true and so this version is mainly about bug fixing.
Specifically there was an issue related to GDK’s threads which caused the application to abort. Besides that, exporting a document or saving/loading a project was not working correctly due to unicode issues (because Python is very nice but working with unicode is sometimes more annoying than it should be, at least in versions prior to Python 3).
Anyway, all that should be working correctly now!

Besides squashing bugs, I also made some long due changes: made the Preferences dialog smaller (by adding its contents to a scrolled window) and migrated the application and engines’ settings to the XDG user configuration folder as opposed to .ocrfeeder.
Yes, I know that I should be using GSettings for the application’s settings by now but there were more critical changes to be done.
Besides a small change in the widgets that set a box’s type (from a radio button style to a non-indicator, grouped pair of buttons), there are no other UI changes but I really like how much more polished OCRFeeder seems with the nice recent GTK+ styles.



I have a number of ideas to make the application better not only in terms of UI/UX but also in terms of features. The detection algorithm hasn’t been touched for years and I am sure it can be improved not only in terms of performance but also in terms of accuracy.
One cool feature I’d love to see implemented is to have a quick way of translating a document’s contents. This would be helpful e.g. to users living abroad who might need to translate letters to a language they speak.
Nonetheless, as mentioned in my previous post about OCRFeeder, it is indeed not easy to find the time and motivation to dedicate to the project these days with all the work, life and other side projects so I don’t know when I will have time for it again. In that regard, if you want to give me a hand, you’d make me very happy as there is a lot of work to be done.

Happy holidays everyone!

Source tarball