ostree & Flatpak at CERN

A week and a half ago I spent a few days in Geneva and gave a presentation about ostree and Flatpak at the CERN Computing Seminar. I started by briefly introducing Endless to give some context of the problems we’re trying to solve and how we’re using ostree and Flatpak for that, then proceeded to talk more in detail about these technologies. In the end, there were several questions, and I was happy to learn afterwards that among the audience there were some of the people working at the CVMFS project: a software distribution service to help deploy data-processing infrastructure and tools. I don’t know the full details about the project’s implementation, but from the problems they’re trying to solve it seems like ostree (or more specifically libostree) could perhaps be used to replace part of the core, which would leverage all the niceties of using a complex Open Source project (more eyeballs looking into bugs, more testing, etc.). I also think more use-cases could be found in the organization, so I hope my talk was a small seed to help introduce these projects at CERN in the medium/long term. The presentation has been recorded if you’re interested.

Getting authorization to access CERN this time was also different, as for the first time I got an entrance pass as a member of the CERN Alumni. So I would like to thank Antonella Del Rosso for the Alumni initiative and also for allowing me to kindly borrow her EU-CH power adapter when I forgot mine at my friends’ home. In the end Antonella also interviewed me about my experience at CERN and after I left, and produced this summary if you want to check it out.
I would also like to thank Miguel Ángel Marquina of the CERN Computing Seminar for organizing the presentation and all the details around it.

Photo showing the author and his daughter sitting close to the lake in Geneva.
Sitting by the lake with my daughter

Having spent more than 2 years in the region, it is the friends we have there that we miss the most. So it was great to meet them and old colleagues again.
My family traveled there with me and we stayed with friends from Spain, so it was funny to see our daughter (who used to play with those friends’ kids all the time when we lived there) excusing her shyness for not speaking Spanish. But after a day or two they were all successfully playing together; it’s amazing how children can get along no matter what differences or barriers they find, while adults often resort to stupid feelings and dangerous actions.
The mountains landscape is another thing we miss in Berlin and the Spring’s clear weather allowed us to fully gaze at the Jura or the Mont Blanc which should last us for another few more months. After that, I guess I’ll try to find some graffiti of mountains around Berlin 🙂

Endless challenges

It’s been a (very long) while since I posted anything here. This past year I even manage to skip my traditional New Year‘s post, and a lot has happened so a post is due!

The big news is that last November I finished my 2 year term at CERN. I learned a lot in there, it is truly a unique place in the world, but I finished what I was set out to do and it was a good time for a change. So I am proud to announce that I have joined Endless!

I started this new job more than a month ago but I couldn’t find the time to write about it before because we also moved to a new country… and found out how challenging it is to move with a baby! Last year we had finally started enjoying that Swiss/French region a lot more but for a number of reasons we decided to move back to Berlin! Still, looking back, Geneva is where our daughter was born and we left many great friends in there so it’ll always be a special place for us and we’ll always have a reason to visit it.

Endless

If you do not know about Endless, its mission is to provide computers to the other half of the world, the part that desperately needs access to technology and knowledge but doesn’t happen to be in the minds, hearts or plans of the big software corporations.

Endless Computer

I met Endless in the beginning of 2013 and ever since I’ve had a special spot for what they’re doing, therefore I am very happy and thrilled to be part of it! So far it’s been really great to see things from the inside and witness the great talent, passion and energy with which everyone carries out their work. I can tell you that the human values at Endless are not something that just sounds nice but they are instead really a core part of its mission.
My goal in technology has always been to use it to solve problems; to make things easier for users. The products Endless is developing, and the users it develops them for, fit that end perfectly.

Endless's users

Hiring

Endless is accomplishing its mission using GNOME and many other Open Source technologies, and has gathered a great team in many fronts, from software to marketing to leadership. Changing the world is a lot of work, so if you want to help, Endless is hiring at this very moment! Take a look at these jobs if you want to apply. If you have any questions regarding the jobs or the company you can also drop me a line.

And that’s it for now and I hope this long period of neglecting my blog has finally come to an end 🙂

Announcing libradosfs

At CERN the disk storage is managed by a system built in-house called EOS (developed by CERN’s IT Data & Storage Services group). EOS manages data from the LHC experiments and also from “regular” users and power applications such as CERNBox, a file storage/sharing service for CERN users.

EOS uses an in-memory namespace. This allows it to achieve very high metadata performance but has the downside that when it is started, the namespace data needs to be read entirely from disk into memory. With the amount of data we have in EOS at CERN (~40PB and ~400 million files), this feature naturally affects the scalability of the system, which is taking around 30 minutes to boot at the moment.

Introducing libradosfs

A good part of my job here has been to create a new storage backend that should improve this. So, for a while now, I have been working on a new filesystem library that scales horizontally and keeps the very namespace stored on disk.
This library is based on the popular Ceph storage system and uses its RADOS (a distributed object store) client API. With lack of a more creative name, we have called it libradosfs.
The long term goal for libradosfs, as hinted in the paragraphs above, is to replace EOS’s backend and namespace implementation, leveraging also all the nice features of Ceph (replication, erasure coding, etc.).

If you are familiar with Ceph, you are at least aware of CephFS. So why aren’t we using CephFS instead? It is a fair question.
We started by implementing a small backend using CephFS and performed some tests but the fact that CephFS relies on metadata servers (MDS) and, since the code for using multiple MDS instances is not yet stable, using one single MDS represented a scalability issue for the amount of data and users we have (along with other limitations like having only one metadata pool per filesystem). Besides that, and equally important, by using directly the RADOS API, we have more freedom to tackle issues that affect our use-cases and more flexibility than what a POSIX-compliant system offers.

Thus, libradosfs should not be mistaken for CephFS. If you want a more generic, fast and POSIX-compliant filesystem implementation on top of Ceph, you should use CephFS. The good folks at Inktank (now Red Hat) are actively making it better and, who knows, maybe in the future CERN might use it too.
If you somehow have use-cases that are more close to CERN’s in terms of scaling horizontally, and especially if you need a simpler code base that you can quickly tweak, you may want to look into libradosfs.

So, to summarize its definition: libradosfs is a client-side implementation of a filesystem using RADOS, designed to provide a scale-out namespace and optimized directory/file access without strict POSIX semantics.

The implementation has been refined quite a few times always with the purpose of offering a more scalable and flexible design.
Writing about the implementation of the whole system would result in a very large post so I will just write about some parts of the core design below.

Implementation overview

libradosfs can associate any pools in Ceph with path prefixes, distinguishing between metadata and data pools. Metadata pools are used to store information about the directories so faster and replicated pools should be configured as metadata pools. Data pools store mostly file-related objects, so ideally erasure-coded pools should be used as data pools (for space efficiency).
Pools are associated with a prefix to offer flexibility in terms of cluster configuration: e.g. we can have testing pools associated with the prefix /test/ so that we can easily set which specific data should go into them.

The namespace and hierarchy are stored directly in the object store, a directory is an object whose name is the full path (which points to an inode). For example, /a/b/c/ means there will be four objects named with the prefix paths: /, /a/, /a/b/ and /a/b/c/. This way, instead of having to traverse a directory / -> a/ -> b/ -> c/, we can directly stat the object using the full path.
Files are implemented in a different way. They are represented by an entry in their parent directory’s omap (the omap is a key-value map associated with objects in RADOS) where the key determines the file name, e.g. file.notes.txt.
Both directories and files have an associated object that acts like an inode, holding the file’s contents. We find out about a directory’s inode from a specific entry in its path object’s omap (rfs.inode); when it comes to files, the value associated with their key will hold information such as the pool and the inode name. Inode objects in libradosfs use a UUID as their name rather than an index since this is more suitable for a distributed environment.

Directories’ entries are stored as an incremental log in the directories’ inode objects. For example, creating file /doc.odt , /notes.txt and deleting /doc.odt would create the following entries in the root directory’s log:
+name=’doc.odt’
+name=’notes.txt’
-name=’doc.odt’

The reason we have a log like this is so that we can quickly keep track of a directory’s contents incrementally, that is, every time we list the directory we read its contents from the last byte we had read previously rather than having to reread all the entries. (Before you worry, there’s a directory log compaction mechanism in place as well)

Files are written in chunks of a user-defined size for better IO and we also support inline files (for small files whose inode creation might not be worth it).
The files’ API offers synchronous and asynchronous write methods which use timed locks: instead of locking and unlocking the file every time it is written, it locks the file for a while once and renews the lock when needed. Writing becomes faster this way, since we skip having to lock/unlock the file on every write call. To prevent locking the file for too long, files are unlocked if they have been idle (not claimed for a while).

Filesystem’s objects have UNIX file permissions associated with them. There is a also a way to change the user/group id at any point in the filesystem so the actions performed on it are as if that user/group had done them. This means that permissions are verified when e.g. accessing a file, but the library does not have access control to restrict who really uses it. That is, the system uses the user/group id set on it but does not restrict who sets it. That is beyond the scope of the library as this can be achieved by having the library accessed via a gateway: for example EOS or XRootD check the permissions on their own already but can use libradosfs for the storage backend.

Among others, the API provides a parallel stat method which can be used to get information about a number of paths in the system. From the way files are designed, statting files that belong to the same directory is much faster when performed together than individually (because we have to read the omap multiple times in the latter). Thus in a use-case like listing a directory, using the Filesystem::stat with a vector of paths is much faster than looping through the entries, statting one by one (the stat method groups the paths per parent directory by itself, no need for a more specialized API).

There is a find method similar to UNIX find which looks for files/directories in a fast, parallel way, based on information like the size, name (with regex), attributes, etc.

Besides being able to set metadata in the form of extended attributes in directories or files, directories also support metadata about their entries, for information that is more intimate to this relation and which should not be set in the files directly. This metadata is also stored in the log, just like the entries themselves.

We also implemented a quota system as this is a common use case when managing a large storage system. The library does not enforce the quotas because the effect of exceeding the quota varies in definition and in the actions that should be taken; instead it includes an API to assign directories, users and groups to Quota objects, and to control the values associated with those. This feature is dependent on Ceph’s numops CLS which simplifies atomic arithmetic operations and we chose to include it in libradosfs so the quota system can be kept close to the filesystem.

Besides the features mentioned above, which are integrated in the library, there is also a FUSE client being developed at the moment.

Tools

libradosfs ships with a filesystem checker tool (libradosfs-fsck) and a benchmark (libradosfs-bench) for calculating the performance on a given cluster.

Performance tests

Although we cannot yet provide very detailed performance tests results, I gathered some figures using libradosfs‘s benchmark tool. The benchmark consists in counting how many files it can create and write synchronously for a given period. In this case, each benchmark ran for 5 minutes from my development machine (an i3 3.3 GHz dual core machine, 8 GB RAM running Fedora 22), using a cluster of 8 machines with 128 OSDs in total running the Ceph hammer release v0.94.3. The network connection is 1 Gbps and this also impacts the tests as you can notice for the larger files in the table.

Update:

After the announcement, I was spent some time benchmarking and checking how I could improve the benchmark results, basically trying to reduce the round trips to the server.

In the process I found out something weird: the faster cluster (using SSD disks, for metadata and inline files’ operations) was actually giving me worse numbers than the slower cluster (HDD disks), specifically the SSD cluster was twice as slow. So after digging for a while, I realized that it was actually the reading, not the writing that was slower and, more importantly, this was true not only for libradosfs but also for Ceph’s own rados benchmark!

In the end, the cluster admin investigated it and found out that the main reason was that TCMalloc was being run with the default cache size of 16 MB. Changing this value to 256 MB, the SSD cluster is now faster then the HDD one (as it should be) and the libradosfs‘s benchmark operations that use mainly this cluster are almost 4x faster than the old ones as seen in the new table below:

File size Avg files/sec
0 (touching a file) 170.12
1KB (inline file) 80.43
1MB (inline + inode) 10.94
500MB (inline + inode) 0.22
1GB (inline + inode) 0.11

Old numbers for reference:

File size Avg files/sec
0 (touching a file) 47.66
1KB (inline file) 31.17
1MB (inline + inode) 8.18
500MB (inline + inode) 0.21
1GB (inline + inode) 0.11

For additional information, the files were configured with an inline buffer of 1K so when writing more than 1K, both the inline buffer and the actual inode get written.
For reference, the number of possible synchronous write operations using RADOS with the above configuration is around 130 (but of course, creating/writing each file involves several IO operations in libradosfs).

As I mentioned, these are early numbers and there is surely room for improvements as reducing the number of trips to the server, but, although speed is always important, the most relevant aspect is that the numbers for more than 1 client should remain virtually the same (when each client writes in different directories).

Other Features

Here is a list summarizing some of the features we currently support:
* directory/file hierarchy
* multiple pool/path association
* direct inode creation and lazy association with a path
* quotas
* filesystem checker
* find method based on path, size and metadata/extended attributes
* inline files
* parallel statting
* symbolic links
* movable directories/files
* vector (parallel) read

Although libradosfs was designed with CERN’s use-case in mind, we tried to make it generic enough so that it may be useful for similar but non-High-Energy-Physics use-cases too. The library has been under development for a while now but it is not yet tested in production so you should not expect it to be a stable system at this point.

libradosfs is released under the LGPL license, in case you want to try it yourself or contribute to it, you can get the source and documentation here.