Handling Disk Failure in MapR FS – #WhiteboardWalkthrough

Editor's Note:  In this week's Whiteboard Walkthrough, Abizer Adenwala, Technical Support Engineer at MapR, walks you through what a storage pool is, why disks are striped, reasons disk would be marked as failed, what happens when a disk is marked failed, what to watch out for before reformatting/re-adding disk back, and what is the best path to recover from disk failure. Here's the video: 

Here's the transcription: 

Welcome, everyone, to a MapR Whiteboard Session. I'm going to talk about disk failures in MapR File system. My name is Abizer Adenwala. I work in the MapR support group as a technical support engineer.

To begin with, what are storage pools? In MapR-FS architecture, nodes consist of multiple disks, and all these disks are divided into storage pools. Why do we need storage pools? Say, if you have, say, 10 disks, you can have all those disks in one storage pool, or you can have multiple storage pools. By default, we have 3 disks for a storage pool, and all the disks in the storage pool are striped with Raid 0. Why do we need striping? We need disk striping just to get better read and write performance. That's the reason we have the disk stripe across the whole SP.

Now, what are the reasons a disk would be marked as failed, and how do I know the reason why it failed? Under MapR op logs, there's a failed.log, where, if it is taken offline, you will see the reason why that disk was taken offline. There are common scenarios why disks are taken offline. One is a CR error, which means that the disk, the block of that disk, has been corrupted, so MapR software has taken that offline. It could be that there was an I/O error, or an I/O timeout, basically. An I/O timeout would happen if the disks are slower than what MapR software expects it to be. There is a property mfs.io.disk.timeout under mfs.conf, where you basically set how much the disk should be slow, and how much we can tolerate. By default, it's 60 seconds, but if an I/O takes more than 60 seconds, the disk will be taken offline. Subsequently, the whole SP goes offline.

Another reason why you would see a disk has gone offline is if, for any reason, the disk disappears from the OS. You will see "No disk found" or "Device not found." If for any hardware reason or disk controller fault, if the disk disappears from the OS, MapR software also doesn't see that disk, and those disks are taken offline, and the SP itself goes offline.

Let's say, what happens when a disk fails, since all the disks are part of a storage pool, right? Say, if one of the disks in one storage pool fails, all of the disks in this SP are taken offline. The reason being, think of all your data as being in one storage pool, and one disk fails, so it's like a hole in the SP. This data is not consistent. That's why the whole SP is taken offline, all the disks on the SP are taken offline. Once you take the whole SP offline, you might see a disk failure alarm. Even if you do some MapR utilities, like mrconfig sp list, you will see an SP offline.

When a disk is taken offline, what do you expect, apart from a failed disk alarm? You should also see a volume under replicated alarm, and an R volume unavailable alarm. If you see a data unavailable alarm, that's a red flag, which means at least one container of that volume doesn't have a valid master copy. What does that mean? Basically, say, if there was one disk which had some data on it, and this was the only disk which had that data, and that disk has gone offline, that means you don't have that data. The possibility of that is very low, because all the data across the file system has three applications. But, say if three disks on different nodes fail at the same time, then you will see a data unavailable alarm, which is kind of risky. That's a red flag, and we should stop there.

Another alarm you might see is data under replicated, which means if I had three copies of our data, and one copy goes offline, that means I have two copies. Because my application count is not met (three) that's the reason we have that alarm. Basically, what I will do is nothing. I just have to wait until this data gets replicated somewhere else on some other SP on some other node.

OK…the path to recovery: how do I recover if I see a data unavailable alarm? Say if I have one disk which has failed for some reason. I first have to go and check my faileddisk.log, to see what is the reason my disk has failed – to see if I have an I/O error or CRC error. There are ways to bring back that SP online. The first thing I want to do is, on that node, an mrconfig sp list and get it from offline. I want to get the SP which is offline. Now that I have an SP name, because the whole SP is offline, now I want to run fsck utility on it. The reason I want to do that is, when this SP was taken offline due to one of the disks which was failed, it was marked with an error signal. So MapR-FS doesn't bring this SP back online, because we know that there is some problem or there was some inconsistency with it. So I have to run, depending on what error I get in faileddisk.log, different flags with fsck.

In most cases, if I run an fsck with a repair option, it should basically go and check each block for consistency across the SP, and it will remove that error signal once fsck has completed successfully. After that, all I have to do is an mrconfig sp name, and bring that SP online. That's it. Once I see that SP is online, that means my data came back. Now, I want to wait until my alarm goes off, because if I just had one set of data, we want to wait until it replicates to some other nodes as well, before we do any maintenance on this disk.

What is the other scenario? The other scenario is when you had data under replicated. With data under replicated, at any point, there is no system admin activity, which is very urgent. All you have to do is wait for the MapR file system to self-heal. What basically it means is, if one of the disks was failed, and one SP was taken offline because of that, you know that the same data is there on some other SP on some other node. Once MapR software realizes that one of my data is not available, after, say, one hour, it will by default go ahead and do a replication of the same data on some other node on some other SP. After it completes that replication, you won't see any alarms anymore. All you will see is failed disk alarm. After that, basically, you can take the disk out, do some hardware tests, and if you really find out that there was any problem with the disk, you can basically replace those disks and reformat the disk back, and get the MapR-FS online again.

That's it for my whiteboard walkthrough. You can comment or post any questions to the link below. You can also follow us on Twitter, @MapR, #WhiteboardWalkthrough. Thanks for watching.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free