Endless OS 3.0 is out!

So our latest and greatest Endless OS is out with the new 3.0 version series!
The shiny new things include the use of Flatpak to manage the applications; a new app center (GNOME Software); a new icon set; a new Windows installer that gives you the possibility of installing Endless OS in dual-boot; and many bug fixes.

Apps, apps, apps!

Endless cake to celebrate the 3.0 release, made by Jonathan Blandford

Endless cake to celebrate the 3.0 release! A work of art and flavor made by Jonathan Blandford

One of the big changes is the replacement of our old (and in-house) App Store by GNOME Software — the GNOME app center. Most of my time the past months has been spent in adapting this project to our needs. GNOME Software is surely a complex beast but I have been getting the invaluable help of its maintainer — Richard Hughes — who I now owe many Weißbiere.
Last week I gave a talk at the first edition of the Libre Application Summit in Portland about the work we’re doing regarding the applications story in the Endless OS: the evolution of the applications in the OS, the motivation behind some decisions, the changes we did to GNOME Sofware, etc. A video and slides should be up on the internetz soon if you want to know about that in more detail.

Join the future

The changes in this new 3.0 version may not seem such a big deal on the surface but everybody had to work really hard to make it happen and they open a lot of possibilities for our users and developers. We’re betting big on Flatpak and we want to see it succeed as not only Endless would benefit from it but pretty much every user of a Linux desktop. So if you’re an app developer, check it out and talk to the community if you need some help. We’re also still hiring, in case you are looking for new challenges.

Be sure to try the Endless OS and drop your thoughts or questions in our Community Forum.

Endless and LAS GNOME

I’ve been spending the week in San Francisco where I’ve been going every day to the awesome Endless‘ office in SoMa.
It’s been really great to talk in person to all the people I usually have to ping on the internetz and experience a bit of the office life in San Francisco.

Endless' Office in San Francisco

Next Monday I am speaking at the Libre Application Summit GNOME in Portland about how we’re managing and delivering the applications to our Endless OS’s users. I am also very curious to check out the city of Portland as everybody tells me good things about it.
If you’re attending the event, come say hi!

Going to GUADEC 2016

That’s right, tomorrow I will take the train down to Karlsruhe to attend GUADEC 2016 after a 3-year absence (time flies!).

I am looking forward to meeting old friends and attending some nice talks and BOF events. At Endless we have been very busy working on our next release of the Endless OS and some of my colleagues are giving very interesting talks related to our work so be sure to check them out! I have been working on the applications story which will be using Flatpak and GNOME Software so if you are interested in knowing more about that we can have a chat too.

badge-goingto-guadec-2016

Endless challenges

It’s been a (very long) while since I posted anything here. This past year I even manage to skip my traditional New Year‘s post, and a lot has happened so a post is due!

The big news is that last November I finished my 2 year term at CERN. I learned a lot in there, it is truly a unique place in the world, but I finished what I was set out to do and it was a good time for a change. So I am proud to announce that I have joined Endless!

I started this new job more than a month ago but I couldn’t find the time to write about it before because we also moved to a new country… and found out how challenging it is to move with a baby! Last year we had finally started enjoying that Swiss/French region a lot more but for a number of reasons we decided to move back to Berlin! Still, looking back, Geneva is where our daughter was born and we left many great friends in there so it’ll always be a special place for us and we’ll always have a reason to visit it.

Endless

If you do not know about Endless, its mission is to provide computers to the other half of the world, the part that desperately needs access to technology and knowledge but doesn’t happen to be in the minds, hearts or plans of the big software corporations.

Endless Computer

I met Endless in the beginning of 2013 and ever since I’ve had a special spot for what they’re doing, therefore I am very happy and thrilled to be part of it! So far it’s been really great to see things from the inside and witness the great talent, passion and energy with which everyone carries out their work. I can tell you that the human values at Endless are not something that just sounds nice but they are instead really a core part of its mission.
My goal in technology has always been to use it to solve problems; to make things easier for users. The products Endless is developing, and the users it develops them for, fit that end perfectly.

Endless's users

Hiring

Endless is accomplishing its mission using GNOME and many other Open Source technologies, and has gathered a great team in many fronts, from software to marketing to leadership. Changing the world is a lot of work, so if you want to help, Endless is hiring at this very moment! Take a look at these jobs if you want to apply. If you have any questions regarding the jobs or the company you can also drop me a line.

And that’s it for now and I hope this long period of neglecting my blog has finally come to an end :)

Announcing libradosfs

At CERN the disk storage is managed by a system built in-house called EOS (developed by CERN’s IT Data & Storage Services group). EOS manages data from the LHC experiments and also from “regular” users and power applications such as CERNBox, a file storage/sharing service for CERN users.

EOS uses an in-memory namespace. This allows it to achieve very high metadata performance but has the downside that when it is started, the namespace data needs to be read entirely from disk into memory. With the amount of data we have in EOS at CERN (~40PB and ~400 million files), this feature naturally affects the scalability of the system, which is taking around 30 minutes to boot at the moment.

Introducing libradosfs

A good part of my job here has been to create a new storage backend that should improve this. So, for a while now, I have been working on a new filesystem library that scales horizontally and keeps the very namespace stored on disk.
This library is based on the popular Ceph storage system and uses its RADOS (a distributed object store) client API. With lack of a more creative name, we have called it libradosfs.
The long term goal for libradosfs, as hinted in the paragraphs above, is to replace EOS’s backend and namespace implementation, leveraging also all the nice features of Ceph (replication, erasure coding, etc.).

If you are familiar with Ceph, you are at least aware of CephFS. So why aren’t we using CephFS instead? It is a fair question.
We started by implementing a small backend using CephFS and performed some tests but the fact that CephFS relies on metadata servers (MDS) and, since the code for using multiple MDS instances is not yet stable, using one single MDS represented a scalability issue for the amount of data and users we have (along with other limitations like having only one metadata pool per filesystem). Besides that, and equally important, by using directly the RADOS API, we have more freedom to tackle issues that affect our use-cases and more flexibility than what a POSIX-compliant system offers.

Thus, libradosfs should not be mistaken for CephFS. If you want a more generic, fast and POSIX-compliant filesystem implementation on top of Ceph, you should use CephFS. The good folks at Inktank (now Red Hat) are actively making it better and, who knows, maybe in the future CERN might use it too.
If you somehow have use-cases that are more close to CERN’s in terms of scaling horizontally, and especially if you need a simpler code base that you can quickly tweak, you may want to look into libradosfs.

So, to summarize its definition: libradosfs is a client-side implementation of a filesystem using RADOS, designed to provide a scale-out namespace and optimized directory/file access without strict POSIX semantics.

The implementation has been refined quite a few times always with the purpose of offering a more scalable and flexible design.
Writing about the implementation of the whole system would result in a very large post so I will just write about some parts of the core design below.

Implementation overview

libradosfs can associate any pools in Ceph with path prefixes, distinguishing between metadata and data pools. Metadata pools are used to store information about the directories so faster and replicated pools should be configured as metadata pools. Data pools store mostly file-related objects, so ideally erasure-coded pools should be used as data pools (for space efficiency).
Pools are associated with a prefix to offer flexibility in terms of cluster configuration: e.g. we can have testing pools associated with the prefix /test/ so that we can easily set which specific data should go into them.

The namespace and hierarchy are stored directly in the object store, a directory is an object whose name is the full path (which points to an inode). For example, /a/b/c/ means there will be four objects named with the prefix paths: /, /a/, /a/b/ and /a/b/c/. This way, instead of having to traverse a directory / -> a/ -> b/ -> c/, we can directly stat the object using the full path.
Files are implemented in a different way. They are represented by an entry in their parent directory’s omap (the omap is a key-value map associated with objects in RADOS) where the key determines the file name, e.g. file.notes.txt.
Both directories and files have an associated object that acts like an inode, holding the file’s contents. We find out about a directory’s inode from a specific entry in its path object’s omap (rfs.inode); when it comes to files, the value associated with their key will hold information such as the pool and the inode name. Inode objects in libradosfs use a UUID as their name rather than an index since this is more suitable for a distributed environment.

Directories’ entries are stored as an incremental log in the directories’ inode objects. For example, creating file /doc.odt , /notes.txt and deleting /doc.odt would create the following entries in the root directory’s log:
+name=’doc.odt’
+name=’notes.txt’
-name=’doc.odt’

The reason we have a log like this is so that we can quickly keep track of a directory’s contents incrementally, that is, every time we list the directory we read its contents from the last byte we had read previously rather than having to reread all the entries. (Before you worry, there’s a directory log compaction mechanism in place as well)

Files are written in chunks of a user-defined size for better IO and we also support inline files (for small files whose inode creation might not be worth it).
The files’ API offers synchronous and asynchronous write methods which use timed locks: instead of locking and unlocking the file every time it is written, it locks the file for a while once and renews the lock when needed. Writing becomes faster this way, since we skip having to lock/unlock the file on every write call. To prevent locking the file for too long, files are unlocked if they have been idle (not claimed for a while).

Filesystem’s objects have UNIX file permissions associated with them. There is a also a way to change the user/group id at any point in the filesystem so the actions performed on it are as if that user/group had done them. This means that permissions are verified when e.g. accessing a file, but the library does not have access control to restrict who really uses it. That is, the system uses the user/group id set on it but does not restrict who sets it. That is beyond the scope of the library as this can be achieved by having the library accessed via a gateway: for example EOS or XRootD check the permissions on their own already but can use libradosfs for the storage backend.

Among others, the API provides a parallel stat method which can be used to get information about a number of paths in the system. From the way files are designed, statting files that belong to the same directory is much faster when performed together than individually (because we have to read the omap multiple times in the latter). Thus in a use-case like listing a directory, using the Filesystem::stat with a vector of paths is much faster than looping through the entries, statting one by one (the stat method groups the paths per parent directory by itself, no need for a more specialized API).

There is a find method similar to UNIX find which looks for files/directories in a fast, parallel way, based on information like the size, name (with regex), attributes, etc.

Besides being able to set metadata in the form of extended attributes in directories or files, directories also support metadata about their entries, for information that is more intimate to this relation and which should not be set in the files directly. This metadata is also stored in the log, just like the entries themselves.

We also implemented a quota system as this is a common use case when managing a large storage system. The library does not enforce the quotas because the effect of exceeding the quota varies in definition and in the actions that should be taken; instead it includes an API to assign directories, users and groups to Quota objects, and to control the values associated with those. This feature is dependent on Ceph’s numops CLS which simplifies atomic arithmetic operations and we chose to include it in libradosfs so the quota system can be kept close to the filesystem.

Besides the features mentioned above, which are integrated in the library, there is also a FUSE client being developed at the moment.

Tools

libradosfs ships with a filesystem checker tool (libradosfs-fsck) and a benchmark (libradosfs-bench) for calculating the performance on a given cluster.

Performance tests

Although we cannot yet provide very detailed performance tests results, I gathered some figures using libradosfs‘s benchmark tool. The benchmark consists in counting how many files it can create and write synchronously for a given period. In this case, each benchmark ran for 5 minutes from my development machine (an i3 3.3 GHz dual core machine, 8 GB RAM running Fedora 22), using a cluster of 8 machines with 128 OSDs in total running the Ceph hammer release v0.94.3. The network connection is 1 Gbps and this also impacts the tests as you can notice for the larger files in the table.

Update:

After the announcement, I was spent some time benchmarking and checking how I could improve the benchmark results, basically trying to reduce the round trips to the server.

In the process I found out something weird: the faster cluster (using SSD disks, for metadata and inline files’ operations) was actually giving me worse numbers than the slower cluster (HDD disks), specifically the SSD cluster was twice as slow. So after digging for a while, I realized that it was actually the reading, not the writing that was slower and, more importantly, this was true not only for libradosfs but also for Ceph’s own rados benchmark!

In the end, the cluster admin investigated it and found out that the main reason was that TCMalloc was being run with the default cache size of 16 MB. Changing this value to 256 MB, the SSD cluster is now faster then the HDD one (as it should be) and the libradosfs‘s benchmark operations that use mainly this cluster are almost 4x faster than the old ones as seen in the new table below:

File size Avg files/sec
0 (touching a file) 170.12
1KB (inline file) 80.43
1MB (inline + inode) 10.94
500MB (inline + inode) 0.22
1GB (inline + inode) 0.11

Old numbers for reference:

File size Avg files/sec
0 (touching a file) 47.66
1KB (inline file) 31.17
1MB (inline + inode) 8.18
500MB (inline + inode) 0.21
1GB (inline + inode) 0.11

For additional information, the files were configured with an inline buffer of 1K so when writing more than 1K, both the inline buffer and the actual inode get written.
For reference, the number of possible synchronous write operations using RADOS with the above configuration is around 130 (but of course, creating/writing each file involves several IO operations in libradosfs).

As I mentioned, these are early numbers and there is surely room for improvements as reducing the number of trips to the server, but, although speed is always important, the most relevant aspect is that the numbers for more than 1 client should remain virtually the same (when each client writes in different directories).

Other Features

Here is a list summarizing some of the features we currently support:
* directory/file hierarchy
* multiple pool/path association
* direct inode creation and lazy association with a path
* quotas
* filesystem checker
* find method based on path, size and metadata/extended attributes
* inline files
* parallel statting
* symbolic links
* movable directories/files
* vector (parallel) read

Although libradosfs was designed with CERN’s use-case in mind, we tried to make it generic enough so that it may be useful for similar but non-High-Energy-Physics use-cases too. The library has been under development for a while now but it is not yet tested in production so you should not expect it to be a stable system at this point.

libradosfs is released under the LGPL license, in case you want to try it yourself or contribute to it, you can get the source and documentation here.