Cisco and MapR Deliver Performance and Multi-tenancy to Help Tame Big Data
Big data provides an enormous wealth of information to your organization. But to gain the most benefit, you need to manage it efficiently. And you must make sure that all this data is separated and isolated so that each set of users can see and work on only the data that they are authorized to use.
Challenges of Multi-tenancy for Big Data
Organizations seek to share IT resources cost efficiently and securely among multiple applications, data, and user groups. Platforms that support this architecture are commonly known as multitenant technologies.
Multi-tenancy is the capability of a single instance of software to serve multiple tenants. A tenant is a group of users that have the same view of the system. Hadoop is an enterprise data hub, and it demands multi-tenancy. Big data platforms are increasingly expected to support multi-tenancy by default. Multi-tenancy requires isolation of the distinct tenants: both the data in the data platform and the computing aspect.
To support, solutions need to:
- Help ensure that service-level agreements (SLAs) are met
- Help guarantee data and compute isolation
- Enforce quotas
- Establish security and delegation
- Help ensure low-cost operations and simpler manageability
The Solution: Cisco UCS Integrated Infrastructure for Big Data with MapR
The Cisco UCS® Integrated Infrastructure for Big Data solution includes computing, storage, connectivity, and unified management capabilities to help companies manage the dramatically increasing data that they must cope with today. It is built on Cisco Unified Computing System™ (Cisco UCS) infrastructure using Cisco UCS 6200 Series Fabric Interconnects, (optional) Cisco Nexus® 2200 platform fabric extenders, and Cisco UCS C-Series Rack Servers. Installed in pairs, the fabric interconnects offer redundant, active-active connectivity and embedded management using Cisco UCS Manager.
MapR is a complete distribution for Apache Hadoop that packages more than a dozen projects from the Hadoop ecosystem to provide you with a broad set of big data capabilities. The MapR platform provides enterprise-class features such as high availability, disaster recovery, security, and full data protection. It also allows Hadoop to be easily accessed as traditional network attached storage (NAS) with read-write capabilities and multitenancy.
The MapR Distribution offers multitenancy from the start. It provides powerful features to logically partition a physical cluster to provide separate administrative control, data placement, job processing, user quotas, and network access. Volumes—a unique feature in MapR—are the foundation of multi-tenancy. Volumes provide a way to organize data and apply different policies to different data sets, applications, and users and groups. A single cluster can have many volumes: up to hundreds of thousands.
Together, Cisco and MapR provide enterprises with transparent, simplified data as well as management integration with an enterprise application ecosystem. They transparently work together to provide a uniquely capable, industry-leading architectural platform for Hadoop-based applications.
Cisco UCS Solution for MapR
The Cisco UCS solution for MapR is based on Cisco UCS Integrated Infrastructure for Big Data, a highly scalable architecture that includes computing, storage, connectivity, and unified management capabilities and is designed to meet a variety of scale-out application demands. It achieves this with transparent data integration and management integration capabilities built using the components described here, shown in Figure 1.
Cisco UCS 6200 Series Fabric Interconnects
Fabric interconnects establish a single point of connectivity and management for the entire system. They provide high-bandwidth, lowlatency connectivity for servers, with integrated, unified management for all connected devices provided by Cisco UCS Manager. Deployed in redundant pairs, the interconnects offer the full active-active redundancy, performance, and exceptional scalability needed to support the large number of nodes that are typical in clusters serving big data applications. The manager enables rapid and consistent server configuration using service profiles, automating ongoing system maintenance activities such as firmware updates across the entire cluster as a single operation. It also offers advanced monitoring with options to raise alarms and send notifications about the health of the entire cluster.
Cisco UCS C240 M4 Rack Server
The rack server supports a wide range of computing, I/O, and storage-capacity demands in a compact design. The server is based on the Intel® Xeon® E5 v3 Family Processors and supports 12-Gbps SAS throughput, delivering significant performance and efficiency gains over the previous generation of servers. The server uses dual Intel Xeon processor E5-2600 v3 series CPUs and supports up to 768 GB of main memory (128 or 256 GB is typical for big data applications) and a range of disk drive and SSD options. Twentyfour small-form-factor (SFF) disk drives are supported in the performanceoptimized option, and 12 large-formfactor (LFF) disk drives are supported in the capacity-optimized option, along with two 1 Gigabit Ethernet embedded LAN-on-motherboard (LOM) ports. The Cisco UCS Virtual Interface Card (VIC) 1227 is designed for the M4 generation of Cisco UCS C-Series Rack Servers. The VIC is optimized for high-bandwidth and low-latency cluster connectivity, with support for up to 256 virtual devices that are configured on demand through Cisco UCS Manager.
MapR Distribution Including Apache Hadoop: Complete Hadoop Platform
As one of the technology leaders in Hadoop, MapR provides an enterprise-class Hadoop solution that can be quickly developed and easily administered. With significant investment in critical technologies, MapR offers a comprehensive Hadoop platform fully optimized for performance and scalability. The MapR Distribution includes over 20 tested and validated Hadoop software modules on an advanced data platform, offering exceptional ease of use, reliability, and performance for Hadoop deployments (See Figure 2).
The benefits of the MapR’s distribution solution include:
- Performance: Ultra-fast throughput
- Scalability: Up to a trillion files, with no restrictions on the number of nodes in a cluster
- Standards-based APIs and tools: Standard Hadoop APIs, including Open Database Connectivity (ODBC), Java Database Connectivity (JDBC), Lightweight Directory Access Protocol (LDAP), and Linux (Pluggable Authentication Module (PAM)
- MapR Direct Access Network File System (NFS): Random read-write high speed operations, real-time data flows, and transparent support for existing non-Java applications
- Manageability: Advanced management console, rolling upgrades, and support for Representational State Transfer (REST) API
- Integrated security: Kerberos and non-Kerberos options with wire-level encryption
- Advanced multi-tenancy: Volumes, data placement control, job placement control and queues.
- Consistent snapshots: Full data protection with point-in-time recovery
- High availability: Ubiquitous high availability with no-NameNode architecture, YARN high availability, and NFS high availability
- Disaster recovery: Cross-site replication with mirroring
- MapR-DB: Integrated enterpriseclass NoSQL database
Main Benefits of Multi-tenancy in MapR with UCS
Volumes (unique to MapR) form the foundation of multi-tenancy as offered by MapR.
In a typical deployment, the data for each user, group, application, or business unit is placed in a single volume so that it can be managed separately from the data of other users, groups, applications, and business units.
Other Hadoop distributions do not support volumes, so policies can be defined only at the file or directory level (too detailed) or at the cluster level (not detailed enough). As a workaround, organizations using other Hadoop distributions create separate physical clusters for each tenant, which add architectural complexity, and thus higher risk of errors and failure. Multi-tenancy in MapR also has significant total cost of ownership (TCO) advantages. It allows organizations to use a single cluster for multiple use cases rather than having to maintain a large number of isolated clusters. This approach reduces overall administrative overhead. It also enables the higher efficiency of a common resource pool.
Here are some of the unique features of multi-tenancy in Cisco UCS Integrated Infrastructure for Big Data with MapR:
- Data placement control: MapR provides the ability to restrict a volume to a subset of a cluster’s nodes. This feature allows to isolate sensitive data and applications and to use heterogeneous hardware. For example, data placement control can be used to keep specific data on separate nodes with different configurations, or to keep Apache Spark data on nodes that have SSDs. It can also be used for more advanced storage tiering policies, such as to keep old data on nodes that have a higher storage capacity and less computing power (such as Cisco UCS C3160 servers), and hence a lower cost per terabyte (TB) of storage. In combination with the MapR warden pluggable services, data placement control also enables administrators to designate specific nodes for a given application or service, such as Spark, effectively creating a minicluster within the larger cluster to help guarantee SLAs and resource availability.
- Job placement control: MapR provides the ability to restrict a specific job or jobs from a specific user or group to a subset of the nodes in the cluster. This feature enables administrators to help guarantee SLAs for specific applications and to create separation between different applications or business units. This feature also allows administrators to designate a small subset of the nodes for low-priority jobs or jobs that require access to external systems through the corporate firewall.
- Access control and security: MapR provides fine-grained, role-based access controls (RBAC) with access control expressions (ACEs) for tables, column families, and columns in MapR-DB; Unix permissions for files; and field-level access control via Apache Drill views.
- MapR also provides cryptographically secure wire-level authentication and encryption. Organizations that have a Kerberos infrastructure can use it for authentication. Organizations that do not have a Kerberos infrastructure can use an integrated and simpler scheme that provides the same security without the complexity associated with Kerberos deployment and management. This leverages Linux Pluggable Authentication Modules (PAM) to enable integration with any PAM-supported registry.
- Administration and reporting: MapR allows organizations to define and enforce storage, CPU, and memory quotas at the volume, user, and group levels. To help enable service providers to provide accurate usage and billing information, MapR offers resource usage reports encompassing more than 60 different metrics. These metrics are available through the MCS browserbased user interface, and—for upstream integration—through the command-line interface (CLI) and the REST API.
The current version of the Cisco UCS Integrated Infrastructure for Big Data offers the configurations listed in Table 1. The configuration used depends on the computing and storage requirements of Hadoop.
For More Information
For more information about Cisco UCS big data solutions, please visit www.cisco.com/go/bigdata_design.
For more information about Cisco UCS Integrated Infrastructure for Big Data, please visit blogs.cisco.com/datacenter/cpav3/.
For more information about MapR, please visit www.mapr.com.
For more information about the Cisco® SmartPlay program, please visit www.cisco.com/go/smartplay.
For more information on the Cisco Validated Design (CVD) for the solution, please visit www.cisco.com/c/dam/en/us/td/docs/unified_computing/ucs/UCS_CVDs/Cisco_UCS_Integrated_Infrastructure_for_Big_Data_with_MapR.pdf.DOWNLOAD PDF