Security and Big Data Governance
Securing data has been a daunting requirement for decades, and the explosion of big data only makes the objective harder. The challenge is not only about scale, but also about the shift from structured data to unstructured data. While implementing security measures remains a complex process, the stakes are continually raised as the ways to defeat security controls become more sophisticated.
Some of the key unique advantages that you get in the MapR Converged Data Platform include:
- Extensible authentication – Support for Linux Pluggable Authentication Modules (PAM) gives you the widest registry support for authenticating to your MapR cluster. (MapR also provides Kerberos integration).
- Granular access controls – Access Control Expressions (ACEs) provide fine-grained permissions on MapR-DB at the table, column family, and column levels, using flexible Boolean expressions.
- Comprehensive auditing – Complete logging to see which users took what actions is available in JSON format so that you understand user behavior as well as demonstrate compliance with regulations. You can query/analyze your audit data through technologies such as Apache Drill, BI tools like Tableau, or your existing security information and event management (SIEM) system.
- Volumes and snapshots – Logical partitions of your data sets, along with the ability to create immutable views of them via snapshots, make it possible for you to track transformation history of your data, and help support your data lineage, auditing, and retention/purging requirements.
Security in the MapR Converged Data Platform
To help you secure your data, MapR offers two broad approaches. First, at the product level, MapR adds capabilities to the MapR Platform to help you properly secure your Hadoop and NoSQL data to prevent unauthorized access. Second, at the solution level, MapR offers the capabilities that let you deploy a large-scale anomaly detection solution that alerts you to network intrusion, phishing, and other cyberattacks. For more information, please see our webpage on Security and Risk Management.
MapR aims to make security easier to implement while retaining the power and flexibility you need to secure your big data. MapR provides the following:
Authentication (i.e., the identification of users) in MapR ensures all platform operations are secured, including:
- User operations such as file reads and writes, database manipulations, and MapReduce job submissions
- Intra-cluster node-node interactions including remote procedure calls
- Inter-cluster operations such as mirroring
MapR provides two primary options for authentication:
Kerberos is a commonly used protocol for authenticating (i.e., identifying) users on a computer system, including Hadoop clusters. Kerberos is a ticket-based system in which the user first requests a ticket from the Kerberos server, and the issued ticket is used as a trusted identifier to all services covered by that Kerberos server. The Kerberos integration in MapR lets you leverage your existing Kerberos infrastructure for authenticating users on your MapR cluster.
For customers that do not use Kerberos, MapR provides a native authentication mechanism that operates equivalently to Kerberos and offers simplified configuration. Since it leverages Linux Pluggable Authentication Modules (PAM), you get the widest registry support, including validation of username and password against /etc/passwd, LDAP, Kerberos, etc.
Authorization entails the configuration of permissions for users. MapR provides sophisticated authorization controls to ensure that users can perform only the activities for which they have permissions, such as data access, job submission, cluster administration, etc. These permissions can be granted by an administrator via the browser-based MapR Control System (MCS) management and monitoring interface or via command line utilities.
Access Control Expressions
Access Control Expressions (ACEs) are a powerful and flexible mechanism to grant permissions on structured data stored in MapR-DB, the integrated NoSQL database in the MapR Converged Data Platform. With ACEs, you get more flexibility than standard access control lists (ACLs). ACEs are Boolean expressions that allow AND and OR logic when defining permissions. The flexibility lets you specify fine-grained access control at the column and/or column-family level in MapR-DB. Examples of ways you can grant permissions include:
- OR-based permissions found in standard ACLs
- “Sales department” OR “marketing department”
- AND-based permissions
- “VP level” AND “marketing department”
- Granular permissions
- (“VP level” OR “director level”) AND (“sales department” OR “marketing department”) AND !(“John Doe”)
Unix File Permissions
For files and directories in MapR-FS, you can leverage standard Unix-style permissions to grant access to authorized users. Since MapR-FS is a POSIX file system with full read/write capabilities, it can be accessed the same way that Linux file systems are accessed. This means existing file-based Linux applications can access files in MapR without any code changes or recompilation.
Access Control Lists
MapR supports access control lists (ACLs) to grant permissions for performing administration tasks at both the cluster and the volume level. Examples of tasks include starting/stopping services, creating volumes, creating mirrors, and changing mirror properties. MapR ACLs also control which users and groups can perform specified tasks on specified job queues, including the ability to submit, kill, or reprioritize jobs.
The auditing capabilities in MapR are critical for regulatory compliance, as well as for understanding user behavior in the system. Regulations often require the ability to prove which user accessed which data, and logging user behavior helps in several situations including identifying suspicious activities on sensitive data.
MapR records accesses of data (files, directories, and MapR-DB table data) that are enabled for auditing, as well as operations on these objects, and executions on the command line (maprcli) including those commands that modify the configurations of a MapR cluster. Log entries are written in JSON format and can be analyzed with Apache Drill, your security information and event management (SIEM) solution, or other third party tools. Log files are also retained for as long as you specify.
There are four types of auditing in MapR:
- maprcli commands that are related to cluster management
- Authentications to the MapR Control System (MCS)
- Operations on directories and files
- Operations on MapR-DB tables
MapR supports data encryption as an additional means of preventing unauthorized access of sensitive data. Encryption is useful to avoid exposure to breaches such as packet sniffing and theft of storage devices.
To avoid data theft by packet sniffing, over-the-wire encryption is available between MapR nodes, between applications and a MapR cluster, as well as across an NFS connection from an edge node to a MapR cluster using the MapR POSIX client.
Encryption at Rest
Encryption at rest not only prevents unauthorized users from accessing sensitive data, but it also protects against data theft via sector-level disk accesses. This type of theft might occur when storage devices are physically stolen, or when storage in cloud environments is eventually reallocated to another user. Encryption is often done on specific subsets of data, and MapR volumes enable the logical partitioning of data to allow clear delineation of secure data from open data. Encrypting drives and encryption device drivers are easy and proven ways to implement encryption at rest on MapR, since its file system component, MapR-FS, behaves identically to regular file systems. Alternatively, encrypting and decrypting data at rest, including the use of key management solutions, can be handled by MapR Advantage Partners specializing in data security.
Field-level encryption enables securing specific sections of data residing in files. This capability logically behaves like access controls on a structured data set in a database management system. Some data elements in the files will remain open, while the secured data elements will be encrypted, and can be decrypted by authorized users when used in conjunction with key management technologies. Field-level encryption is provided by MapR Advantage Partners specializing in data security.
Format-preserving Encryption and Masking
Format preserving encryption (FPE) is a mechanism for encrypting data so that the format generally remains the same. This allows applications to access data that looks legitimate, instead of the typically garbled text that encryption outputs. This technique is particularly useful for analytical tasks that require readability in the encrypted data elements. Masking is similar in that it replaces sensitive data elements with an unidentifiable value, but is not truly an encryption technique so the original value cannot be returned from the masked value.
A significant benefit of these techniques is that the cost of securing a big data deployment is reduced. As secure data is migrated from a secure source into your platform, FPE or masking reduces the need for applying additional security controls on that data while it resides in your platform.
Both of these techniques are available from MapR Advantage Partners specializing in data security.
Data Governance in the MapR Converged Data Platform
Data governance is an important discipline for MapR customers. Requirements can vary greatly from company to company, so it’s important for stakeholders to define the people and processes that drive governance, and use technology as an enabler.
While data governance is a practice that covers your entire data architecture, there are some specific capabilities that have components directly focused on data in the MapR Platform. One framework for data governance in MapR that you can follow is shown below:
Note that this is not an all-encompassing framework, but it highlights the priorities for big data deployments. The components listed above are addressed by both native MapR capabilities as well as value-adding partner technologies. Examples of how MapR can address the above requirements include:
Data Integration – MapR support for data integration includes Apache open source projects such as Sqoop and Flume. Support for POSIX NFS on Hadoop, unique to MapR, lets you quickly and easily load data into MapR as if it were a network attached storage (NAS) device, speeding your time to value on imported data.
Security – As mentioned earlier, MapR provides extensible authentication, powerful authorization controls, comprehensive auditing, and data encryption to deploy a secure big data cluster.
Data Lineage – MapR customers track lineage by creating staging areas as defined by MapR volumes. An example configuration might include a landing zone where raw incoming data is first imported, followed by a data cleansing/preparation stage, followed by processing by data scientists, and finally a trusted and query-ready data zone. MapR snapshots can capture exact, immutable views of staging areas so that the history of the data transformations can be preserved. When data veracity questions arise from business stakeholders, the data lineage can be tracked back through the various staging areas to identify how the final data sets arose. (see image below)
Information Lifecycle Management – MapR customers can set policies around data retention, archiving, and purging via MapR volumes, which provide a context for those policies on associated data sets. Mirroring enables disaster recovery strategies with low recovery point objectives (RPOs) and low recovery time objectives (RTOs) to handle your stringent business continuity requirements.
Auditing – As mentioned earlier, the comprehensive auditing capabilities in MapR let you track which users take what actions in a MapR cluster.
Download the whitepaper on security by Enterprise Management Associates
5 Keys to Enabling Secure Data Sharing & Analytics in Hadoop with HP Security Voltage
Survive Hadoop Go-Live: Achieving Security, Compliance, and Business Needs with Centrify
Best Practices to Deploy a Governed Data Lake with Waterline Data
The d’Artagnan of Hadoop (Spoiler Alert: Data Governance for Hadoop) by Waterline Data
Securing Hadoop Data – What Are Your Options? by HP Security Voltage