After joining MapR back in 2009, I spent many months meeting with early Hadoop users and listening to their pain points. In many of these meetings, users described problems related the HDFS architecture and the NameNode in particular. In this blog post I wanted to share 10 NameNode-related issues that came up frequently in these meetings:
- We want HA, but the NameNode is a single point of failure. This results in downtime due to hardware failures and user errors. In addition, it is often non-trivial to recover from a NameNode failure, so our Hadoop administrators always need to be on call.
- We want to run Hadoop with 100% commodity hardware. To run HDFS in production and not lose all our data in the event of a power outage, HDFS requires us to deploy a commercial NAS to which the NameNode can write a copy of its edit log. In addition to the prohibitive cost of a commercial NAS, the entire cluster goes down any time the NAS is down, because the NameNode needs to hard-mount the NAS (for consistency reasons).
- We need both a NameNode and a Secondary NameNode. We read some documentation that suggested purchasing higher-end servers for these roles (e.g., dual power supplies). We only have 20 nodes in the cluster, so this represents a 15-20% hardware cost overhead with no real value (i.e., it doesn’t contribute to the overall capacity or throughput of the cluster).
- We have a significant number of files. Even though we have hundreds of nodes in the cluster, the NameNode keeps all its metadata in memory, so we are limited to a maximum of only 50-100M files in the entire cluster. While we can work around that by concatenating files into larger files, that adds tremendous complexity. (Imagine what it would be like if you had to start combining the documents on your laptop into zip files because there was a severe limit on how many files you could have.)
- We have a relatively small cluster, with only 10 nodes. Due to the DataNode-NameNode block report mechanism, we cannot exceed 100-200K blocks (or files) per node, thereby limiting our 10-node cluster to less than 2M files. While we can work around that by concatenating files into larger files, that adds tremendous complexity.
- We hired a new engineer who did not understand the architectural issues and ran a simple directory traversal (the equivalent of the find command). This created so much load on the NameNode that it simply crashed, and the entire cluster was down.
- We need much higher performance when creating and processing a large number of files (especially small files). Hadoop is extremely slow.
- We have had outages and latency spikes due to garbage collection on the NameNode. Although we are using the CMS (concurrent mark and sweep) garbage collector, the NameNode still freezes occasionally, causing the DataNodes to lose connectivity (i.e., become blacklisted).
- When we change permissions on a file (chmod 400 foo), the changes do not affect existing clients who have already opened the file. We have no way of knowing who the clients are. It’s impossible to know when the permission changes would really become effective, if at all.
- We have lost data due to various errors on the NameNode. In one case, the root partition ran out of space, and the NameNode crashed with a corrupted edit log.
When we looked at this list of NameNode-related problems, it was clear to all of us that the only viable solution was to eliminate the NameNode. Our engineering team spent two years re-architecting Hadoop’s storage layer (as well as advancing Hadoop’s MapReduce layer and developing the leading management suite for Hadoop).
The end result is that we have eliminated these 10 issues and many others. In my next blog post I’ll dive deeper into our no-NameNode architecture so that you can understand how it works, and why it really eliminates the issues with NameNode-based architectures (including all planned HDFS enhancements, such as HDFS Federation and HA NameNode). In the meantime, if you’ve run into other NameNode-related problems that I haven’t listed, let me know.