Securing Your Data: How to Audit a MapR Cluster – Whiteboard Walkthrough

In this week's Whiteboard Walkthrough, Mitesh Shah, Product Management at MapR, describes how you can track administrative operations and data accesses in your MapR Converged Data Platform in a seamless and efficient way with the built-in auditing feature.

Here is the unedited transcription: 

Hello everybody, this is Mitesh Shah, Product Manager for Security and Data Governance at MapR. Today I'd like to talk to you about auditing in the MapR Converge Data Platform. Auditing, as you're probably aware, is one of the four key pillars of security, the other three being authentication, authorization, and data protection. Auditing can be useful for a variety of reasons, perhaps for security, maybe there's a breach, perhaps something bad happened in the cluster, you want to go back in time and review what happened. Well, auditing and logs can certainly help with that.

Another reason could be for regulatory compliance reasons. Some compliance frameworks like HIPAA or PCI, for example, may require that you actually log certain information, MapR auditng can help with that. The third reason could be just simply around building a data heat map of sorts, around understanding what data's being used or not used. For example, in your cluster so that you can start moving off data that's not used to perhaps less dense disks.

What I'd like to talk to you today is a little bit in detail around what an audit log actually looks like so I walked in examples specifically around data access. There are other types of auditing that can be performed in a cluster and we'll talk about those briefly. Then we'll move into configuration options in MapR auditing, so things like coalesce interval or max size or retention periods. These things really help with specifying what is logged, how things are being logged, and actually in some cases can help with performance as well.

Let's walk through an example. In data access, so this is a data access example where we've got a user named J Smith, that's his username. He's trying to read a file. It's a simple operation, J Smith reading a file. Well, with that single operation actually quite a bit of information is actually logged. In this case, you see there's a time stamp so it's basically telling you the who, what, when, where, everything but the why. In time stamp, you see J. Smith has come in on February 16th at 11:15. You see there's a Z at the very end, basically indicating zulu time zone. It can be GMT or UTC, for example. They're all the same thing, same time zone.

You can see that J Smith is actually performing a lead operation in the file. You see two things here in the file. You see two things here on the next line, one is his user name, J Smith, telling you who performed the action, and the user ID. In fact, in raw format you're only going to see user ID, and we'll get to this in a second, but you'll see the source file ID and the volume ID as well. You don't actually see the full source path and the raw file names, you'll actually have to go in and use a tool called Expand Audit Utility to actually expand out some of this information and see what you're seeing here.

Getting back to this example, you'll see J Smith is performing the operation, he's coming in from an IP address that's a source IP of That's his source path. Again, this is a translated source path from the FID. Now, the reason we're logging things like UID and FIDs and volume IDs rather than the source path is, again, for performance reasons but once they translate you can see that the source path is actually, the file is reading as a QM Report.csv and the volume name is Finance.

Now, one interesting bit of information here is the status code. This status code is zero. That might throw some people off but basically zero means success. These status codes map to Linux error codes and, in this case, zero does mean that J Smith was successful able to read this file. So, that's data access in a nutshell, obviously there are other types of auditing operations that can get logged. I mentioned two of them here, admin actions are basically any actions that are performed on your cluster that were logged, as well as authentication requests. Any time somebody comes into the command line and authenticates or if you MCS or a GUI portal, those actions gets logged whether they are successful or failure attempts.

Moving now to configuration options for auditing. Well, the first thing you need to know and do is actually turn it on. By default obviously auditing is available, you MapR but you actually do need to turn it on and we don't turn it on my default, and that's for performance reasons. Now, you can go all the way down to the file level to specify exactly which files you want audited or don't want audited. The idea being some of your files may be more sensitive and for those files you want to actually log accesses to those files.  Alternatively, you may want everything to be logged, and that's fine too. What you need to do in that case is actually just turn on auditing at the volume level and then turn on auditing at the root path of that volume. Then any new files that go into that volume will automatically get auditing.

Coalesce interval is a parameter that can help with performance. What that means is if I specify the coalesce interval of 30 minutes for example, well that means that if J Smith came in and did a read operation against the same file from the same source IP address within that same 30 minute time period, that actually would log just once. By logging it just once it helps with writing out the disc and helps with performance and so this really is a parameter that can help with performance and reduce the amount of redundant information that's in your audit logs.

Third parameter I want to talk about is max size. That's simply specified in gigabytes and that is a parameter that indicates when you'll see certain alarms pop off in MCS, for example. For example, if you set it to 32 gigabytes, once your audit log reached 32 gigabytes in size, you'll start getting alarms in MapR. The retention period is exactly what it sounds like. Basically, if you set it 30 days for your retention period, in that case once log files reach that 30 day time span we start getting purged off the system. For example, log files that are older than 30 days are actually moved off the system and deleted.

Last but not least is selecting auditing feature, and this can also help with reducing the size of your audit logs and reduce the size of all the information that you're seeing in the logs. Now, there could be a lot of reasons that your audit logs can grow to be quite large. One of those reasons could be you've got a lot of users in the cluster, potentially you've got a lot of files in the cluster. Well, all of that can make for very voluminous audit logs. One thing that can really help reduce the size of those is the selective auditing feature.

You see on the before side here, the select side, basically everything is logged, whether it's a look up of permissions or a make directory, read, write, every action is logged but you may not want that. You may only want to know about when something is read or when something is written. In that case, you can use selective auditing to specify, "I only want to see reads and writes," and so you see the after picture here of just reads and writes being logged.

That's auditing in a nutshell. The next thing to do is just give it a shot and try it. If you have any questions, please indicate them in the comment section below. If you have any ideas for additional Whiteboard Walkthroughs, write them in the comments below as well.



Ebook: Getting Started with Apache Spark
Interested in Apache Spark? Experience our interactive ebook with real code, running in real time, to learn more about Spark.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free