At CERN the disk storage is managed by a system built in-house called EOS (developed by CERN’s IT Data & Storage Services group). EOS manages data from the LHC experiments and also from “regular” users and power applications such as CERNBox, a file storage/sharing service for CERN users.

EOS uses an in-memory namespace. This allows it to achieve very high metadata performance but has the downside that when it is started, the namespace data needs to be read entirely from disk into memory. With the amount of data we have in EOS at CERN (~40PB and ~400 million files), this feature naturally affects the scalability of the system, which is taking around 30 minutes to boot at the moment.

Introducing libradosfs

A good part of my job here has been to create a new storage backend that should improve this. So, for a while now, I have been working on a new filesystem library that scales horizontally and keeps the very namespace stored on disk. This library is based on the popular Ceph storage system and uses its RADOS (a distributed object store) client API. With lack of a more creative name, we have called it libradosfs. The long term goal for libradosfs, as hinted in the paragraphs above, is to replace EOS’s backend and namespace implementation, leveraging also all the nice features of Ceph (replication, erasure coding, etc.).

If you are familiar with Ceph, you are at least aware of CephFS. So why aren’t we using CephFS instead? It is a fair question. We started by implementing a small backend using CephFS and performed some tests but the fact that CephFS relies on metadata servers (MDS) and, since the code for using multiple MDS instances is not yet stable, using one single MDS represented a scalability issue for the amount of data and users we have (along with other limitations like having only one metadata pool per filesystem). Besides that, and equally important, by using directly the RADOS API, we have more freedom to tackle issues that affect our use-cases and more flexibility than what a POSIX-compliant system offers.

Thus, libradosfs should not be mistaken for CephFS. If you want a more generic, fast and POSIX-compliant filesystem implementation on top of Ceph, you should use CephFS. The good folks at Inktank (now Red Hat) are actively making it better and, who knows, maybe in the future CERN might use it too. If you somehow have use-cases that are more close to CERN’s in terms of scaling horizontally, and especially if you need a simpler code base that you can quickly tweak, you may want to look into libradosfs.

So, to summarize its definition: libradosfs is a client-side implementation of a filesystem using RADOS, designed to provide a scale-out namespace and optimized directory/file access without strict POSIX semantics.

The implementation has been refined quite a few times always with the purpose of offering a more scalable and flexible design. Writing about the implementation of the whole system would result in a very large post so I will just write about some parts of the core design below.

Implementation overview

libradosfs can associate any pools in Ceph with path prefixes, distinguishing between metadata and data pools. Metadata pools are used to store information about the directories so faster and replicated pools should be configured as metadata pools. Data pools store mostly file-related objects, so ideally erasure-coded pools should be used as data pools (for space efficiency). Pools are associated with a prefix to offer flexibility in terms of cluster configuration: e.g. we can have testing pools associated with the prefix /test/ so that we can easily set which specific data should go into them.

The namespace and hierarchy are stored directly in the object store, a directory is an object whose name is the full path (which points to an inode). For example, /a/b/c/ means there will be four objects named with the prefix paths: /, /a/, /a/b/ and /a/b/c/. This way, instead of having to traverse a directory / -> a/ -> b/ -> c/, we can directly stat the object using the full path. Files are implemented in a different way. They are represented by an entry in their parent directory’s omap (the omap is a key-value map associated with objects in RADOS) where the key determines the file name, e.g. file.notes.txt. Both directories and files have an associated object that acts like an inode, holding the file’s contents. We find out about a directory’s inode from a specific entry in its path object’s omap (rfs.inode); when it comes to files, the value associated with their key will hold information such as the pool and the inode name. Inode objects in libradosfs use a UUID as their name rather than an index since this is more suitable for a distributed environment.

Directories’ entries are stored as an incremental log in the directories’ inode objects. For example, creating file /doc.odt , /notes.txt and deleting /doc.odt would create the following entries in the root directory’s log: +name=’doc.odt’ +name=’notes.txt’ -name=’doc.odt’

The reason we have a log like this is so that we can quickly keep track of a directory’s contents incrementally, that is, every time we list the directory we read its contents from the last byte we had read previously rather than having to reread all the entries. (Before you worry, there’s a directory log compaction mechanism in place as well)

Files are written in chunks of a user-defined size for better IO and we also support inline files (for small files whose inode creation might not be worth it). The files’ API offers synchronous and asynchronous write methods which use timed locks: instead of locking and unlocking the file every time it is written, it locks the file for a while once and renews the lock when needed. Writing becomes faster this way, since we skip having to lock/unlock the file on every write call. To prevent locking the file for too long, files are unlocked if they have been idle (not claimed for a while).

Filesystem’s objects have UNIX file permissions associated with them. There is a also a way to change the user/group id at any point in the filesystem so the actions performed on it are as if that user/group had done them. This means that permissions are verified when e.g. accessing a file, but the library does not have access control to restrict who really uses it. That is, the system uses the user/group id set on it but does not restrict who sets it. That is beyond the scope of the library as this can be achieved by having the library accessed via a gateway: for example EOS or XRootD check the permissions on their own already but can use libradosfs for the storage backend.

Among others, the API provides a parallel stat method which can be used to get information about a number of paths in the system. From the way files are designed, statting files that belong to the same directory is much faster when performed together than individually (because we have to read the omap multiple times in the latter). Thus in a use-case like listing a directory, using the Filesystem::stat with a vector of paths is much faster than looping through the entries, statting one by one (the stat method groups the paths per parent directory by itself, no need for a more specialized API).

There is a find method similar to UNIX find which looks for files/directories in a fast, parallel way, based on information like the size, name (with regex), attributes, etc.

Besides being able to set metadata in the form of extended attributes in directories or files, directories also support metadata about their entries, for information that is more intimate to this relation and which should not be set in the files directly. This metadata is also stored in the log, just like the entries themselves.

We also implemented a quota system as this is a common use case when managing a large storage system. The library does not enforce the quotas because the effect of exceeding the quota varies in definition and in the actions that should be taken; instead it includes an API to assign directories, users and groups to Quota objects, and to control the values associated with those. This feature is dependent on Ceph’s numops CLS which simplifies atomic arithmetic operations and we chose to include it in libradosfs so the quota system can be kept close to the filesystem.

Besides the features mentioned above, which are integrated in the library, there is also a FUSE client being developed at the moment.

Tools

libradosfs ships with a filesystem checker tool (libradosfs-fsck) and a benchmark (libradosfs-bench) for calculating the performance on a given cluster.

Performance tests

Although we cannot yet provide very detailed performance tests results, I gathered some figures using libradosfs‘s benchmark tool. The benchmark consists in counting how many files it can create and write synchronously for a given period. In this case, each benchmark ran for 5 minutes from my development machine (an i3 3.3 GHz dual core machine, 8 GB RAM running Fedora 22), using a cluster of 8 machines with 128 OSDs in total running the Ceph hammer release v0.94.3. The network connection is 1 Gbps and this also impacts the tests as you can notice for the larger files in the table.

Update:

After the announcement, I was spent some time benchmarking and checking how I could improve the benchmark results, basically trying to reduce the round trips to the server.

In the process I found out something weird: the faster cluster (using SSD disks, for metadata and inline files’ operations) was actually giving me worse numbers than the slower cluster (HDD disks), specifically the SSD cluster was twice as slow. So after digging for a while, I realized that it was actually the reading, not the writing that was slower and, more importantly, this was true not only for libradosfs but also for Ceph’s own rados benchmark!

In the end, the cluster admin investigated it and found out that the main reason was that TCMalloc was being run with the default cache size of 16 MB. Changing this value to 256 MB, the SSD cluster is now faster then the HDD one (as it should be) and the libradosfs‘s benchmark operations that use mainly this cluster are almost 4x faster than the old ones as seen in the new table below:

<th>
  Avg files/sec
</th>

<td>
  170.12
</td>

<td>
  80.43
</td>

<td>
  10.94
</td>

<td>
  0.22
</td>

<td>
  0.11
</td>

File size
0 (touching a file)
1KB (inline file)
1MB (inline + inode)
500MB (inline + inode)
1GB (inline + inode)

Old numbers for reference:

<th>
  Avg files/sec
</th>

<td>
  47.66
</td>

<td>
  31.17
</td>

<td>
  8.18
</td>

<td>
  0.21
</td>

<td>
  0.11
</td>

File size
0 (touching a file)
1KB (inline file)
1MB (inline + inode)
500MB (inline + inode)
1GB (inline + inode)

For additional information, the files were configured with an inline buffer of 1K so when writing more than 1K, both the inline buffer and the actual inode get written. For reference, the number of possible synchronous write operations using RADOS with the above configuration is around 130 (but of course, creating/writing each file involves several IO operations in libradosfs).

As I mentioned, these are early numbers and there is surely room for improvements as reducing the number of trips to the server, but, although speed is always important, the most relevant aspect is that the numbers for more than 1 client should remain virtually the same (when each client writes in different directories).

Other Features

Here is a list summarizing some of the features we currently support:

directory/file hierarchy
multiple pool/path association
direct inode creation and lazy association with a path
quotas
filesystem checker
find method based on path, size and metadata/extended attributes
inline files
parallel statting
symbolic links
movable directories/files
vector (parallel) read

Although libradosfs was designed with CERN’s use-case in mind, we tried to make it generic enough so that it may be useful for similar but non-High-Energy-Physics use-cases too. The library has been under development for a while now but it is not yet tested in production so you should not expect it to be a stable system at this point.

libradosfs is released under the LGPL license, in case you want to try it yourself or contribute to it, you can get the source and documentation here.