Making the Most of Big Data
To seize the invaluable opportunities of Big Data – and meet the IT challenges it presents – companies like yours need to stay on the cutting edge of database technology and support mission-critical business intelligence, analytics, and data warehousing.
- With software from SAP and MapR, you can implement a column-store database and data platform that offer advanced analytics with superior scale-out. Using the multiplex grid option of SAP® IQ database software and the MapR data platform to store your data files, you can realize significant performance gains compared to traditional file systems.
- Pairing SAP IQ database software and the underlying storage technology from MapR enables server clusters in SAP IQ to exhibit near-linear scalability in storage input/output throughput as more server nodes in SAP IQ are added to the cluster. Working together, SAP and MapR conducted tests to demonstrate the high-speed performance of the software. Details of the test results and system architecture configurations are presented throughout this paper.
High-speed performance and powerful analytics
SAP IQ provides excellent data compression, fully parallel data loading, fast ad hoc queries, a rich dialect of structured query language (SQL), built-in full text search, a wide variety of database access protocols, and an extensibility framework for user-defined functions. Sophisticated indexing technology and a powerful optimizer distribute queries across a multiplex grid for massively parallel operation (see the figure below).
- The MapR data platform includes a file system called MapR-FS, a more powerful version of the Apache Hadoop distributed file system (HDFS). Unlike Apache HDFS, which is a layer that runs on other systems, MapR-FS is a true distributed file system that manages direct disk access for Apache Hadoop and other software with demanding input/output requirements.
- MapR is compliant with portable operating system interface (POSIX) criteria and provides an industry-standard network file system (NFS). With MapR you can perform random reads and writes and simultaneously read and write to a file. You get automatic and transparent data compression and integrated multitenant functionality. You can stream data directly to Apache Hadoop clusters and use thousands of existing tools and applications. MapR works well with non-Java programming languages and eliminates the need for most proprietary or specialized Hadoop connectors.
Comparing MapR functionality to the competition
The MapR data platform and its included file system, MapR-FS, are compatible with Apache HDFS and serve the same role in Hadoop while providing substantial additional functionality. With the MapR data platform and MapR-FS, you get:
- Full random read/write access
- True NFS access
- Consistent snapshots
- Data placement control
- Enterprise-grade high availability and disaster recovery
With MapR-FS, as with a standard file system, you can read and write data to any part of an existing file. But Apache HDFS, originally designed to index Web pages, only allows appending – not reading and writing – to existing files.
You can mount MapR-FS using NFS for fast read/write access to data, and the software is scalable to handle extremely large volumes of data. Apache HDFS, on the other hand, requires a staging area to load data that precludes its usefulness as a large-scale platform.
MapR-FS helps protect your data with consistent snapshots that instantly capture an exact view of data at a specific point in time, so you can recover data accidentally deleted or corrupted by user or application error. Apache HDFS snapshots are not consistent, and they often include data written to open files well after the time the snapshot was actually taken.
You can implement policies such as quotas, security, and disaster recovery configurations because the MapR-FS file system supports logical partitioning of the distributed disk space with volumes. Apache HDFS has no such partitioning construct for enabling granular policies.
Data placement control functionality in MapR-FS lets you isolate separate data sets by putting data on specific servers in the cluster. Apache HDFS does not allow you to specify data placement in this manner.
MapR-FS automatically supports high availability with no manual configuration required. Disaster recovery functionality includes scheduled mirroring that sends block-level differentials to a remote replica site. Getting high availability with Apache HDFS, however, requires a specific, complex configuration that is prone to error and failure, and there is no true remote replication or mirroring capability. See the figure below.
Test parameters and architecture configuration
Working together, SAP and MapR conducted
tests to demonstrate the performance of
SAP IQ running on the MapR data platform.
The test system configuration is shown in the
For the test, MapR-FS was set up on eight hosts, each with the following configuration:
- Forty-eight HGST 4TB 7200 RPM SAS hard disk drives
- 4U SuperMicro server
- Two Intel Xeon Ivy Bridge E5-2603V2 1.8GHz CPU with 10 MB cache and four cores (eight cores total)
- Sixteen 16 GB DDR3-1600 RAM (256 GB)
- One Mellanox 40GbE two-port adapter
A multiplex cluster using SAP IQ was set up on four hosts with the same configuration.
Test results show scalability
The performance tests incorporated a variety of random and sequential block access writes and reads, executed from the database servers for SAP IQ to data files residing on the MapR-FS file system. Tests were run on a three- and four-node multiplex grid to measure scalability.
The first scale-out performance test was done with a combined workload of 80% write and 20% read activity, because this workload is similar to a typical application profile for SAP IQ. The test used a block size of 256K, as SAP IQ executes input/output operations with large block sizes.
The three-node multiplex averaged 551 MB per second (MB/sec) for each database server for SAP IQ. With an additional server node, each host was still able to sustain performance of over 500 MB/sec (see the figure on the next page). This is a significant improvement over observed performance with a fibre channel array in which the maximum throughput is split across hosts.
Test results show read performance scales linearly
Another scale-out performance test was done with 100% read activity, as is typically seen on a data warehouse query, with a 256K block size. As with the 80% write workload, the 100% read performance scaled linearly when another host was added. The average read input/output for three nodes was 1,583 MB/sec. With another node added to the multiplex, the average read input/output was still above 1,500 MB/sec – far above, actually, at 6,064 MB/sec (see the figure below).
Your IT and business benefits
SAP IQ and the MapR data platform (along
with the included MapR-FS file system) deliver
a flexible, high-powered solution that can grow
along with your business.
With SAP IQ, you can perform concurrent queries against large amounts of data and turn intelligence into insights to support better decision making across the enterprise. The software’s open architecture, application services layer, excellent performance characteristics, and low administrative overhead give it the flexibility, efficiency, and power to meet your needs.
MapR gives you Big Data scale-out capacity on commodity hardware for a database cluster (using the multiplex grid option of SAP IQ). At the same time, you can gain significant cost savings over traditional network attached storage or storage area network systems. What’s more, the MapR data platform has built-in high availability, data protection, and disaster recovery capabilities, making it an ideal industrial-strength storage solution.
Software from SAP and MapR gives you a cutting-edge analytics solution. When you use a multiplex grid based on the SAP® IQ software running on top of a MapR data platform to store your data files, you will see near-linear scalability in storage input/output throughput as more server nodes are added for SAP IQ. The upshot is high performance and cost savings compared to typical file systems – made clear by the performance test results in this paper.
- Rise to the opportunities and challenges of Big Data
- Support mission-critical business intelligence, analytics, and data warehousing
- Enable high-performance input/output throughput and scalability
- Disk-backed, column-store database with superior functionality for data compression, fast data loading, and ad hoc queries with SAP IQ
- Industry-standard distributed file system from MapR
To find out more, call your SAP representative today or visit us online athttp://scn.sap.com/community/developer -center/analytic-server.