Editor's Note: In this week's Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, talks about the architectural differences between HDFS and MapR-FS that boil down to three numbers.
Here's the transcript:
Hi! I'd like to talk a little bit about the key architectural features of the MapR File System, especially in contrast with the normal HDFS Hadoop distributed file system. Normal, in the sense that it's the default on Hadoop distributions. With HDFS the design is relatively simple, which is consistent with the single goal that HDFS had when it was first built. The design has a name node, which is a single process, notionally a single master process at least, in all cases, which contains an in-memory table which stores all file metadata, and in particular the locations of all of the blocks in the file system. There's a central resource that keeps track of the locations and characteristics of all data in the cluster. That's key.
That's a key aspect of its design, that's what make it quick to implement to begin with, and that's what limits it to some degree to a particular missions. MapR's file system, in contrast, although completely Hadoop compatible, has a different architecture, a different design, and therefore different capabilities. In MapR you have a container location database. It stores a little bit of information beyond container locations, but not much. The key things are the locations, as you might imagine, of things called containers. Containers are replicated data structures which live in the cluster in storage pools. Storage pools are just collections of disks. So each container contains things like files, pieces of files, and directories. Directories contain references to files, and in particular, contain locations of the so-called chunks in a file.
A file therefore consists of a concatenation of a number of chunks which can be found by going to the containers and accessing those chunks directly. This appears to be a considerably more complex design than the HDFS design, and it is a bit more complex in certain respects; but this extra step of indirection by having containers involved here is very, very useful because it introduces parallelism into the system. It also is useful because the container locations are very, very stable quantities. Containers are rarely allocated, and they are very rarely moved. They're only moved on unusual circumstances, say when the clusters space utilization becomes highly unbalanced. As such, the container location database is relatively stable. Containers are also large. That means there's relatively few of them compared to say, chunks or files.
That means that the container location database is not only stable, but it's small. This allows the container location database in the MapR file system to be stored in a special container. That container is of course replicated. That's how the MapR containers work. Those are the unit of transactions and the units of replication. By being replicated to multiple nodes, that inherently makes the MapR system highly available, highly reliable from the ground up, due to architectural considerations. This is exciting because by building that in from the beginning, you don't have to paste on features to try to add high availability to the system; it comes inherently.
All of these considerations can be seen as consequences of one architectural decision that was made very early in the design of MapR-FS. There are several design parameters in each system, but the key design parameter in HDFS is that there is a single block size. This single block size is configurable: For a single HDFS cluster you can change the block size when you configure the cluster, but it cannot be changed later. As such, all applications, all tenants of the cluster have to agree on some compromised value. Normally, that's set ... Oh, excuse me, normally set to 128 MB, the earlier default was 64 MB, and that size is tuned for large scale batch programs.
Makes sense when what you're doing is batch, but by having a single number, this constrains the scalability of an HDFS based file system. A total scale of such a system is roughly defined by the number of blocks, times the block size. If you could handle 100-200-million blocks, and they are each 128 MB in size, you can see the scale that a single name node can scale to. Using multiple name nodes is possible, but it's not supported by any commercial distribution at this time. MapR, on the other hand has three design numbers that are built in. First, is the block size, which is very small: 8 KB. This is much more typical of real time file systems.
Indeed, having very small blocks, which are very appropriate given the physics of how disks work, these small blocks allow the MapR file system to be real-time; to have very precise snapshots and mirroring capabilities. Having a small block allows a single disk rotation to involve many block updates. A small block in a centrally managed system would be a complete disaster because this total size of the file system would be limited to be so small. In MapR this is a fixed number, but that's not a big deal because it's so easy to concatenate them. MapR has a central number, called a chunk ... we couldn't say block, because we already said that. Chunks are the pieces that a file are broken up into to give file parallelism for read across the cluster.
The default in MapR is 256 MB; that's comparable to the default in HDFS. It looks forward a little bit, as opposed to backwards in terms of the growing size of disks. The chunk size in MapR is highly adjustable. The default can be set on a per directory level, and the value can vary from file to file. This is a big deal because this means MapR tends to be much more multi-tenant friendly. You could have certain applications setting it as low as say, 10 MB or even a MB; useful for some applications. Other applications might increase it, to say a GB in size. This is very nice. That allows you to have all the flexibility of parallelism at any scale you like, and it allows you to have many different values within the cluster, completely transparently to the administrator.
We have chunks and we have blocks; but MapR also has a thing called a container. That's what the container location database of course tracks, and the size of this is roughly 10-30 GB on average. This size is self-adjusting. There's no configuration set on that. Containers split when they need to, and they move around as they are required to, so there's no user adjustment there. Since we have a container location database which is very static and easily maintained, can be disk-based, it can have far, far more than 100 or 200-million entries in it. Contrast that with a name node which is limited by memory space. So we could have, say, a billion containers in a system, and each of them can be GBs in size. So a single MapR file system can easily scale well beyond the petabyte level into the exabyte range.
That's because, of course, the scale on the MapR side is roughly a number of containers ... not chunks, not blocks ... times the average container size. The number of containers allowable is much larger than blocks on a name node based system, and the size is much larger than the blocks; almost 1,000 times larger than the blocks in a name node system. So you get a very, very different sort of practical scale for this system. You also gain benefits in throughput. Every metadata operation in an HDFS system has to go via the name node. That's if you want to change the length of a file, add a file, remove a file; all of these have to be recorded centrally. If you have name node HA in operation, so-called HA, then you have to make those operations on multiple machines at the same time.
You have consistency issues, and you have a high throughput rate. With MapR, all of these file metadata operations happen at the directory level. There are many containers with directories in them, and so you have a very high throughput. We've measured throughput on the order of, on the close order of 14-million metadata operations per second, on a reasonably large MapR cluster. Contrast that with the maximum of around 500 operations per second on a name node based system. We get scalability, we get throughput, and we get HA; all because of these three numbers versus the single number. That's just an example of the architectural difference and the benefits that adhere at the architectural level to the MapR file system. Thanks very much.