Organizations seek to share IT resources cost-efficiently and securely among multiple applications, data, and user groups. Platforms that support this architecture are commonly known as multi-tenancy technologies. Big data platforms are increasingly expected to support multi-tenancy out-of-the-box. The key to multi-tenancy is isolation of the distinct tenants, both in terms of the data contained in the data platform as well as the compute aspect.
Use Cases Overview
Multi-tenancy is useful and critical in a range of cases; two typical use cases are:
- Enterprise Data Hub. Often, organizations start using MapR in a specific area, for example, data warehouse optimization for the accounting department. Soon, the marketing department wants to also run an app for churn prediction on MapR. Ultimately, questions arise as to who has access to which data and for what purposes, also taking into account regulatory issues (e.g., the Sarbanes– Oxley Act in banking or data protection legislation throughout the industry).
- Software/Platform/Infrastructure-as-a-Service. Some organizations provide IT services—such as Hadoop-as-a-Service—to internal or external customers. A basic requirement of such service providers is to isolate customers while achieving guaranteed SLAs, be it in terms of availability or latency. In addition to this, a service provider may require the flexibility to be able to run parts of the multi-tenancy deployment in a hybrid cloud setup, for example, to benefit from the elasticity of public clouds.
The MapR Offering
The MapR Distribution for Apache™ Hadoop® offers multi-tenancy out-of-the box. It provides powerful features to logically partition a physical cluster to provide separate administrative control, data placement, job execution, and network access. Volumes—a unique feature in MapR—are the foundation of multi-tenancy. They provide a way to organize data and apply different policies to different data sets, applications and users/groups. A single cluster can have many volumes, up to hundreds of thousands.
In a typical deployment, the data for each user, group, application or business unit is grouped into a single volume so that it can be managed separately from the data of other users, groups, applications, and business units.
Other Hadoop distributions do not support volumes, so policies can only be defined at the file or directory level (too granular) or at the cluster level (too course). As a workaround, organizations using other Hadoop distributions create separate physical clusters for each tenant, which add architectural complexity and thus, higher risk of error and failure.
Multi-tenancy in MapR also has significant total cost of ownership (TCO) advantages, allowing organizations to leverage a single cluster for multiple use cases rather than maintaining a large number of isolated clusters. This reduces overall administrative overhead, and also enables the higher utilization efficiency of a common resource pool.
Data Placement Control
With MapR, you can restrict a volume to a subset of a cluster’s nodes. This provides the ability to isolate sensitive data/applications, as well as the ability to leverage heterogeneous hardware. For example, data placement control can be used to keep personally identifiable information (PII) data on separate nodes that have self-encrypting drives, or to keep Apache™ HBase® data on nodes that have SSDs.
Further, it can be used for more advanced storage tiering policies, such as keeping old data on nodes that have a higher storage capacity and less compute power, and hence a lower cost per TB storage. In combination with MapR pluggable services, this feature also enables administrators to designate specific nodes for a given application/service, such as Spark/Shark, effectively creating a “mini-cluster” within the larger cluster for the purpose of guaranteeing SLAs and resource availability.
Job Placement Control
MapR provides the ability to restrict a specific job, or jobs from a specific user or group, to a subset of the nodes in the cluster. This enables administrators to guarantee SLAs for specific applications, and to create separation between different applications or business units. This also allows administrators to designate a small subset of the nodes for low-priority jobs (such as experiments) or jobs that require access to external systems through the corporate firewall.
Access Control and Security
MapR provides fine-grained access control based on Access Control Expressions (ACEs) for tables, column families, and columns; POSIX access control lists for files; and strong, role-based access control (RBAC) for tables, column families, and columns.
MapR provides cryptographically secure wire-level authentication and encryption. Organizations that have a Kerberos infrastructure can leverage it for authentication, while organizations that do not have a Kerberos infrastructure can leverage an integrated, key-based scheme that provides the same security without the complexity associated with deploying and managing Kerberos.
Administration and Reporting
From an administrative perspective, MapR allows organizations to define and enforce storage, CPU, and memory quotas at the volume, user, and group levels. Further, especially relevant for service providers to provide accurate usage and billing information, MapR offers reporting on resource usage on over 60 different metrics, available via the MapR Control System (MCS) browser-based user interface, and—for up-stream integration—via the command-line interface and the REST API.
Optimal enterprise architectures support distinct applications, data, and user groups in the same cluster, and the MapR Distribution for Hadoop offers multi-tenancy support out-of-the-box. This is achieved through volumes, enabling isolation both for compute and storage, and includes security and reporting. Many MapR customers throughout different verticals have deployed multi-tenancy applications into production since 2011.