Endless challenges

It’s been a (very long) while since I posted anything here. This past year I even manage to skip my traditional New Year‘s post, and a lot has happened so a post is due!

The big news is that last November I finished my 2 year term at CERN. I learned a lot in there, it is truly a unique place in the world, but I finished what I was set out to do and it was a good time for a change. So I am proud to announce that I have joined Endless!

I started this new job more than a month ago but I couldn’t find the time to write about it before because we also moved to a new country… and found out how challenging it is to move with a baby! Last year we had finally started enjoying that Swiss/French region a lot more but for a number of reasons we decided to move back to Berlin! Still, looking back, Geneva is where our daughter was born and we left many great friends in there so it’ll always be a special place for us and we’ll always have a reason to visit it.


If you do not know about Endless, its mission is to provide computers to the other half of the world, the part that desperately needs access to technology and knowledge but doesn’t happen to be in the minds, hearts or plans of the big software corporations.

Endless Computer

I met Endless in the beginning of 2013 and ever since I’ve had a special spot for what they’re doing, therefore I am very happy and thrilled to be part of it! So far it’s been really great to see things from the inside and witness the great talent, passion and energy with which everyone carries out their work. I can tell you that the human values at Endless are not something that just sounds nice but they are instead really a core part of its mission.
My goal in technology has always been to use it to solve problems; to make things easier for users. The products Endless is developing, and the users it develops them for, fit that end perfectly.

Endless's users


Endless is accomplishing its mission using GNOME and many other Open Source technologies, and has gathered a great team in many fronts, from software to marketing to leadership. Changing the world is a lot of work, so if you want to help, Endless is hiring at this very moment! Take a look at these jobs if you want to apply. If you have any questions regarding the jobs or the company you can also drop me a line.

And that’s it for now and I hope this long period of neglecting my blog has finally come to an end 🙂

Announcing libradosfs

At CERN the disk storage is managed by a system built in-house called EOS (developed by CERN’s IT Data & Storage Services group). EOS manages data from the LHC experiments and also from “regular” users and power applications such as CERNBox, a file storage/sharing service for CERN users.

EOS uses an in-memory namespace. This allows it to achieve very high metadata performance but has the downside that when it is started, the namespace data needs to be read entirely from disk into memory. With the amount of data we have in EOS at CERN (~40PB and ~400 million files), this feature naturally affects the scalability of the system, which is taking around 30 minutes to boot at the moment.

Introducing libradosfs

A good part of my job here has been to create a new storage backend that should improve this. So, for a while now, I have been working on a new filesystem library that scales horizontally and keeps the very namespace stored on disk.
This library is based on the popular Ceph storage system and uses its RADOS (a distributed object store) client API. With lack of a more creative name, we have called it libradosfs.
The long term goal for libradosfs, as hinted in the paragraphs above, is to replace EOS’s backend and namespace implementation, leveraging also all the nice features of Ceph (replication, erasure coding, etc.).

If you are familiar with Ceph, you are at least aware of CephFS. So why aren’t we using CephFS instead? It is a fair question.
We started by implementing a small backend using CephFS and performed some tests but the fact that CephFS relies on metadata servers (MDS) and, since the code for using multiple MDS instances is not yet stable, using one single MDS represented a scalability issue for the amount of data and users we have (along with other limitations like having only one metadata pool per filesystem). Besides that, and equally important, by using directly the RADOS API, we have more freedom to tackle issues that affect our use-cases and more flexibility than what a POSIX-compliant system offers.

Thus, libradosfs should not be mistaken for CephFS. If you want a more generic, fast and POSIX-compliant filesystem implementation on top of Ceph, you should use CephFS. The good folks at Inktank (now Red Hat) are actively making it better and, who knows, maybe in the future CERN might use it too.
If you somehow have use-cases that are more close to CERN’s in terms of scaling horizontally, and especially if you need a simpler code base that you can quickly tweak, you may want to look into libradosfs.

So, to summarize its definition: libradosfs is a client-side implementation of a filesystem using RADOS, designed to provide a scale-out namespace and optimized directory/file access without strict POSIX semantics.

The implementation has been refined quite a few times always with the purpose of offering a more scalable and flexible design.
Writing about the implementation of the whole system would result in a very large post so I will just write about some parts of the core design below.

Implementation overview

libradosfs can associate any pools in Ceph with path prefixes, distinguishing between metadata and data pools. Metadata pools are used to store information about the directories so faster and replicated pools should be configured as metadata pools. Data pools store mostly file-related objects, so ideally erasure-coded pools should be used as data pools (for space efficiency).
Pools are associated with a prefix to offer flexibility in terms of cluster configuration: e.g. we can have testing pools associated with the prefix /test/ so that we can easily set which specific data should go into them.

The namespace and hierarchy are stored directly in the object store, a directory is an object whose name is the full path (which points to an inode). For example, /a/b/c/ means there will be four objects named with the prefix paths: /, /a/, /a/b/ and /a/b/c/. This way, instead of having to traverse a directory / -> a/ -> b/ -> c/, we can directly stat the object using the full path.
Files are implemented in a different way. They are represented by an entry in their parent directory’s omap (the omap is a key-value map associated with objects in RADOS) where the key determines the file name, e.g. file.notes.txt.
Both directories and files have an associated object that acts like an inode, holding the file’s contents. We find out about a directory’s inode from a specific entry in its path object’s omap (rfs.inode); when it comes to files, the value associated with their key will hold information such as the pool and the inode name. Inode objects in libradosfs use a UUID as their name rather than an index since this is more suitable for a distributed environment.

Directories’ entries are stored as an incremental log in the directories’ inode objects. For example, creating file /doc.odt , /notes.txt and deleting /doc.odt would create the following entries in the root directory’s log:

The reason we have a log like this is so that we can quickly keep track of a directory’s contents incrementally, that is, every time we list the directory we read its contents from the last byte we had read previously rather than having to reread all the entries. (Before you worry, there’s a directory log compaction mechanism in place as well)

Files are written in chunks of a user-defined size for better IO and we also support inline files (for small files whose inode creation might not be worth it).
The files’ API offers synchronous and asynchronous write methods which use timed locks: instead of locking and unlocking the file every time it is written, it locks the file for a while once and renews the lock when needed. Writing becomes faster this way, since we skip having to lock/unlock the file on every write call. To prevent locking the file for too long, files are unlocked if they have been idle (not claimed for a while).

Filesystem’s objects have UNIX file permissions associated with them. There is a also a way to change the user/group id at any point in the filesystem so the actions performed on it are as if that user/group had done them. This means that permissions are verified when e.g. accessing a file, but the library does not have access control to restrict who really uses it. That is, the system uses the user/group id set on it but does not restrict who sets it. That is beyond the scope of the library as this can be achieved by having the library accessed via a gateway: for example EOS or XRootD check the permissions on their own already but can use libradosfs for the storage backend.

Among others, the API provides a parallel stat method which can be used to get information about a number of paths in the system. From the way files are designed, statting files that belong to the same directory is much faster when performed together than individually (because we have to read the omap multiple times in the latter). Thus in a use-case like listing a directory, using the Filesystem::stat with a vector of paths is much faster than looping through the entries, statting one by one (the stat method groups the paths per parent directory by itself, no need for a more specialized API).

There is a find method similar to UNIX find which looks for files/directories in a fast, parallel way, based on information like the size, name (with regex), attributes, etc.

Besides being able to set metadata in the form of extended attributes in directories or files, directories also support metadata about their entries, for information that is more intimate to this relation and which should not be set in the files directly. This metadata is also stored in the log, just like the entries themselves.

We also implemented a quota system as this is a common use case when managing a large storage system. The library does not enforce the quotas because the effect of exceeding the quota varies in definition and in the actions that should be taken; instead it includes an API to assign directories, users and groups to Quota objects, and to control the values associated with those. This feature is dependent on Ceph’s numops CLS which simplifies atomic arithmetic operations and we chose to include it in libradosfs so the quota system can be kept close to the filesystem.

Besides the features mentioned above, which are integrated in the library, there is also a FUSE client being developed at the moment.


libradosfs ships with a filesystem checker tool (libradosfs-fsck) and a benchmark (libradosfs-bench) for calculating the performance on a given cluster.

Performance tests

Although we cannot yet provide very detailed performance tests results, I gathered some figures using libradosfs‘s benchmark tool. The benchmark consists in counting how many files it can create and write synchronously for a given period. In this case, each benchmark ran for 5 minutes from my development machine (an i3 3.3 GHz dual core machine, 8 GB RAM running Fedora 22), using a cluster of 8 machines with 128 OSDs in total running the Ceph hammer release v0.94.3. The network connection is 1 Gbps and this also impacts the tests as you can notice for the larger files in the table.


After the announcement, I was spent some time benchmarking and checking how I could improve the benchmark results, basically trying to reduce the round trips to the server.

In the process I found out something weird: the faster cluster (using SSD disks, for metadata and inline files’ operations) was actually giving me worse numbers than the slower cluster (HDD disks), specifically the SSD cluster was twice as slow. So after digging for a while, I realized that it was actually the reading, not the writing that was slower and, more importantly, this was true not only for libradosfs but also for Ceph’s own rados benchmark!

In the end, the cluster admin investigated it and found out that the main reason was that TCMalloc was being run with the default cache size of 16 MB. Changing this value to 256 MB, the SSD cluster is now faster then the HDD one (as it should be) and the libradosfs‘s benchmark operations that use mainly this cluster are almost 4x faster than the old ones as seen in the new table below:

File size Avg files/sec
0 (touching a file) 170.12
1KB (inline file) 80.43
1MB (inline + inode) 10.94
500MB (inline + inode) 0.22
1GB (inline + inode) 0.11

Old numbers for reference:

File size Avg files/sec
0 (touching a file) 47.66
1KB (inline file) 31.17
1MB (inline + inode) 8.18
500MB (inline + inode) 0.21
1GB (inline + inode) 0.11

For additional information, the files were configured with an inline buffer of 1K so when writing more than 1K, both the inline buffer and the actual inode get written.
For reference, the number of possible synchronous write operations using RADOS with the above configuration is around 130 (but of course, creating/writing each file involves several IO operations in libradosfs).

As I mentioned, these are early numbers and there is surely room for improvements as reducing the number of trips to the server, but, although speed is always important, the most relevant aspect is that the numbers for more than 1 client should remain virtually the same (when each client writes in different directories).

Other Features

Here is a list summarizing some of the features we currently support:
* directory/file hierarchy
* multiple pool/path association
* direct inode creation and lazy association with a path
* quotas
* filesystem checker
* find method based on path, size and metadata/extended attributes
* inline files
* parallel statting
* symbolic links
* movable directories/files
* vector (parallel) read

Although libradosfs was designed with CERN’s use-case in mind, we tried to make it generic enough so that it may be useful for similar but non-High-Energy-Physics use-cases too. The library has been under development for a while now but it is not yet tested in production so you should not expect it to be a stable system at this point.

libradosfs is released under the LGPL license, in case you want to try it yourself or contribute to it, you can get the source and documentation here.

Going to FOSDEM 2015

Tomorrow I am flying to Brussels to attend FOSDEM for the 8th time!
It is amazing to see how much the event grew in these 8 years and I am looking forward to having another great weekend of interesting presentations, meeting old friends and sipping tasty beer.

I need to thank CERN for making this trip possible and if you want to find out about my current project there (soon to be announced), do let me know.

See you in Brussels!


Happy 2015!

Helena and I have just come back from the holidays with our family in Portugal and I would like to tell you how 2014 was a very good and special year in our lives. The big event that will make us never forget this past year was of course the birth of our daughter Olivia. Everyone who has kids will tell you how nice it is to have them and they’re right! All the tiresome, stress and lack of sleep is forgotten when we see her smile every morning.

Helena and I love to travel and have made at least a big trip every year for a few years. With the baby, those trips have to be shorter but since she was born we’ve been already to Portugal (twice), Spain (visiting our old colleagues in A Coruña) and the U.K. (more particularly London). That thing that people tell you about how having a child changes one’s perspective on many things is also very true for traveling. Olivia can be very easily awaken by noises so now we realize how noisy some cars and motorcycles are… London was awful in that regard. The underground was noisy as hell, including the very loud voice warnings. Also, as a big European capital, I was expecting its public sites to be accessible but even in the emblematic Victoria Station there was no elevator to access the underground. The sad thing is that while the stroller is a temporary annoyance for us, people in wheelchairs have to cope with that permanently.
We’re very curious about visiting Berlin with the baby to check those annoyances in there (because I seem to remember the underground being more silent and accessible) so that’s a trip we might do this year.

The book count kept low this year: I read 3 books and started another one which I haven’t finished yet (REAMDE by Neal Stephenson).

Even with the lack of time due to the baby, thanks to my wonderful wife I still keep playing squash and attending the CERN Micro Club once a week. Despite the awkward name, this is one of CERN’s many clubs and is concerned with technology, having several sections. I am part of the Robotics section in the club where we’ve been building a 3D printer.
This means less time for my side projects so this year, unfortunately there was only a couple of versions of OCRFeeder and no new version of Skeltrack.
I think that the only solution for OCRFeeder is to eventually have new contributors if there is an interest in keeping the project alive.

About Skeltrack, although its development was stalled during most of 2014, my friend Iago keeps improving it for his Master’s Thesis, and I had a lot of emails from people who are using it. I even visited one of them at EPFL who is using the project with his own 3D cameras which means that having a device-agnostic library was a good decision.

As for my job at CERN, I am finishing what I have been working on so I hope to talk about it in more detail soon.

This year’s donations went to the following places mainly (apart from the EFF, to whom donate every time I buy a Humble Bundle):
Wikipedia: don’t think I need to explain its relevance;
GNOME Builder: because this great guy was bold enough to quit his job in order to make an awesome and long needed IDE for GNOME (maybe you can still donate!);
Chão dos Meninos: an association from my hometown who helps children in risk — I always used to donate to big international projects such as Wikipedia and EFF but last year I realized that, since I don’t pay taxes in my country (because I live abroad), one way of contributing a small bit is to donate to an association such as this one.

I still do not know what 2015 will bring but I do hope that the tendency continues and it turns out to be a great year again!

Have great 2015 everyone!

photo 2

See you at FOSDEM 2014

This year I was almost skipping FOSDEM. It is a delicate time for me to be out as I will be a dad soon but the doctors say it is supposedly okay if it’s for a couple of days so I am going to FOSDEM for my 7th year in a row!

Due to that uncertainty, I haven’t proposed any presentation but if you want to talk about the projects I’m involved in or about work and life at CERN, let’s do it over a couple of excelent Belgian beers (or waffles if you prefer).

See you in Brussels!