Get Real with Hadoop: True Multi-tenancy

In this blog series, we’re showcasing the top 10 reasons customers are turning to MapR in order to create new insights and optimize their data-driven strategies. Here’s reason #4: MapR provides true multi-tenancy with job isolation, volumes, quotas, data and job placement control, including for YARN.

Multi-tenancy is the ability of a single instance of software to serve multiple tenants. A tenant is a group of users that have the same view of the system. Hadoop, as an enterprise data hub, naturally demands multi-tenancy. Creating different instances of Hadoop for various users or functions is not acceptable as it makes it harder to share data across departments and creates silos.

From an administrator’s perspective, multi-tenancy requirements are to

  • Ensure SLAs are met
  • Guarantee isolation
  • Enforce quotas
  • Establish security and delegation
  • Ensure low cost operations and simpler manageability

The MapR multi-tenant architecture provides a way for you to address these requirements using industry-leading capabilities.

  • Protecting the system
    • MapR includes several mechanisms to protect against runaway jobs. Many of you may experience situations in which the tasks of a poorly designed job consume too much memory and, as a result, the nodes start swapping and quickly become unavailable. Since tasks have an upper bound on memory usage, tasks that exceed this limit are automatically killed with an out-of-memory exception.
    • Quotas on disk usage can be set on a per-user, as well as a per-volume, basis. MapR also provides flow control on each node and therefore after a number of entries are in the queue, the client-side RPC throttles back.
    • MapR reserves memory for critical system services and ensures that they are never starved out of minimum memory requirements in the event of a runaway job.
    • Critical system services are also run at a higher priority (lower nice level) to ensure that they get their share of CPU.
  • Per tenant policy controls
    • MapR provides volumes. Volumes are logical storage and policy management constructs that contain a MapR cluster’s data. Volumes are typically distributed over several nodes in the cluster.
    • Typical use cases include volumes for specific users, projects, departments, development, and production environments. For example, if you need to organize data for a special project, you can create a specific volume for the project. The figure below shows two lines of businesses (retail and trading), each having their own volumes. Additionally each retail and trading user could have their own volume as well.You can mount volumes under other volumes to build a structure that reflects the needs of your organization. The volume structure defines how data is distributed across the nodes.
    • Volumes are great at providing policy management at a logical level. Volumes can be used to:
      • Establish ownership and accountability. Following specific permissions can be granted to other users or groups.
      • Enforce Quotas. You can associate a standard volume with an accountable entity and set quotas. Quotas can be advisory and enforced.
      • Data Placement Control. You can specify which rack or nodes the volume will occupy by selecting a topology for the volume.
      • Disaster Recovery. Each volume can have its own mirror schedule. Data that requires a lower RPO can have a more aggressive schedule than data that can accept a higher RPO.
      • Data Protection. Each volume can have its own snapshot schedule. Every department has its own data protection requirement and volumes can be used to achieve that.
      • Volumes are easy to create and administer through the MapR Control System dashboard. The figure below illustrates how to create a volume and specify many multi-tenancy properties.
  • Meeting SLAs
    • Ability to run workloads with different SLAs on the same cluster
      • MapR enables running both analytical and operational workloads on a single cluster. The common recommendation is to run HBase applications in a separate cluster than the Hadoop cluster because often a large batch analytics job interferes with the operational SLAs of HBase applications. However, with MapR-DB, customers can run both these applications on a single cluster.
    • Job placement control to meet SLAs
      • Label-based scheduling provides job placement control on a multi-tenant hadoop cluster. Using label-based scheduling, you can control exactly which nodes are chosen to run jobs submitted by different users and groups.
      • Label-based scheduling is supported for classic MapReduce (JobTracker, TaskTracker) as well as for YARN.
      • Label-based scheduling supports Fair Scheduler (with preemption) and Capacity Scheduler.
    • Better control of scheduling jobs in YARN
      • MapR provides the ability to use RAM, CPU, as well as disk when making resource calculations to help place jobs on nodes
    • ExpressLane feature allows for small jobs to get ahead if cluster is extremely busy.
  • Quotas
    • Storage quotas are available by volume, user and group.
    • CPU and memory quotas are available by queue/user/group.
  • Bringing in existing enterprise applications without a rewrite to HDFS-specific APIs or YARN framework. MapR supports YARN and HDFS APIs. However, it does not require or mandate that as the only way to interact with the cluster. This allows you to bring in your existing applications and have them access cluster data by simply using NFS. Groups are allowed to mount only the volumes they are permitted to access.
  • Cluster high availability and scalability – scheduling downtime with multiple-tenants is at best avoidable. MapR provides no single points of failure, rolling upgrades, and the ability to have thousands of nodes with hundreds of billions of files in a single cluster.
  • Ensuring a multi-tenant cluster that can keep up with the speed of business and open-source innovation
    • MapR provides monthly updates to open source packages, so you can use the latest cutting edge open source offerings.
    • MapR uniquely supports multiple versions of different packages, enabling backward compatibility. This allows different tenants to pick the version suited for them.
    • With MapR, you can upgrade core Hadoop packages without upgrading ecosystem packages, or you can upgrade ecosystem packages without upgrading the core. This also means that you can perform rolling upgrades to your Hadoop platform without requiring downtime.

In summary, providing multi-tenancy on Hadoop cannot be bolted on. It has to be built into the foundation. These capabilities allow you to run your multi-tenant, multi-service cluster to create a shared services infrastructure that can be a foundation of your competitive advantage.

Be sure to check out the complete top 10 list here.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free