Big Data is emerging as an important tool to help organizations learn more about their business operations, product performance, and customer purchasing behavior. It is misunderstood by the media making it difficult for organizations to determine if investing in this tool will bring results and make it possible to improve efficiency, bring out better products and services or better understand customer requirements.
This paper, which is based upon Kusnetzky Group Analyst's research, insight and opinion, is designed to examine the following topics: the promise of Big Data, challenges of Big Data and a review of MapR Technologies' M7 distribution for Hadoop.
The Promise of Big Data
In simplest terms, “Big Data” refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage extremely large data sets. Does this mean terabytes, petabytes or even larger collections of data? The answer offered by these suppliers is "yes."
Sometimes "Big Data" is described using the "3 Vs." They are Volume, Variety and Velocity.
Big Data tools and services are designed to manage extremely large and growing sources of data that require capabilities beyond that found in traditional database engines.
Big Data tools manage an extensive variety of data as well. This means having the capability to manage structured data, very much like the capabilities offered by a database engine. They go beyond supporting structured data to working with both non-structured data, such as documents, spreadsheets, presentation decks and the like; and log data coming from operating systems, database engines, application frameworks, retail point of sale systems, mobile communications systems and more.
One of the key features that distinguishes Big Data tools from more traditional relational database engines is the ability to gather, analyze and report on rapidly changing sets of data. In some cases, this means having the capability to manage data that changes so rapidly that the updated data cannot be saved to traditional disk drives before it is changed again.
Riding the Technology Shift
Many of the previous attempts to address the need to gather useful information from the rapidly growing, rapidly changing and broad types of data have been based upon the use of special-purpose, complex and highly expensive computing systems. Today's Big Data solutions are built upon a different foundation.
Rather than trying to use a very powerful, dedicated database machine, clusters of inexpensive, powerful, industry standard (X86) systems are harnessed to attack these very same problems.
The clustered approach uses commodity systems, storage, and memory. It also adds the benefit of being more reliable. The failure of any single system in the cluster will not stop processing.
How Does Big Data Differ from Traditional Transactional Systems?
Traditional transactional systems are designed and implemented to track information whose format and use are known ahead of time. Big Data systems are deployed when the questions to be asked and the data formats to be examined aren't known ahead of time.
The goal of Big Data systems is to allow analysts and decision-makers to sift through massive amounts of data to learn something new rather than tracking known outputs of operational systems. It also makes it possible to gather data from new sources including social media and unstructured data sources such as documents, presentations and spreadsheets.
New Tools Make Big Data Much Easier to Use
Several open source communities have developed and now offer projects that make the process of using Big Data processes much simpler.
Many Big Data distributions include tools from Apache Software Foundation including the following:
- Hadoop — Distributed processing framework designed to harness together the power of many computers, each having its own processing and storage, and provide the capability to quickly process large, distributed data sets.
- Hadoop Distributed File System (HDFS) — a distributed file system designed to support large data sets made up of rapidly changing structured and non-structured data.
- MapReduce — A tool designed to allow analysts and developers to rapidly sift through massive amounts of data to examine only those data items that match a specified set of criteria.
- HBase — A distributed database that makes it possible to deal with HDFS data as if it was a structured set of very large tables. It makes the data appear to be large columns and supports No-SQL database solutions. This is often seen as an alternative to the use of MapReduce. In the past HBase has offered inconsistent performance due to its distributed design and has some availability/reliability issues as well.
- Other projects including Hive, Mahout, Pig and ZooKeeper.
It is clear that a Big Data solution has many moving parts, each of which must be properly installed, configured and optimized for the organization's application.
Gathering, installing and integrating all of these individual open source projects is beyond the capabilities of some organizations that wish to use Hadoop.
There are tools designed to help organizations create distributed computing solutions that can process large data sets using simple programming models.
These tools are only beginning to be packaged commercially. This means that training, documentation, support, examples or templates may not be available for each open source project. This makes them somewhat difficult to learn and use.
Commercial developers are rushing in to package these tools and make them usable by most organizations and easy to use.
The Challenge of Big Data
The promises of Big Data are sometimes difficult to realize. Although Big Data has been in use for quite some time, the new tools have only recently appeared on the scene. They can be difficult to learn and may not be well documented. These tools often lack in key features and capabilities to support mission critical, real time, or multi-user environments.
Organizations need a flexible platform that can support a broad range of usage patterns from batch to real-time. This platform must make it easily possible to address organizations' need for predictive analysis or lightweight transaction processing.
If one examines the Apache Software Foundation's Big Data projects, it become clear that it is often necessary to use many different tools to achieve an organization's goals. While these tools are extremely powerful, setting each of them up, integrating them and keeping them working in harmony can be challenging. The open source projects are not packaged as complete solutions. It is necessary to go directly to the open source community to obtain support when it is needed. Commercial support is beginning to appear and these issues are likely to be quickly resolved.
Another significant challenge is that the analysts and decision makers who would be best served by the use of these tools aren't aware of them.
While these analysts and decision-makers may be very familiar with the use of tools such as spreadsheets, SPSS or SAS, they are not technologists and would find installing, integrating and operating the open source software difficult. They also might find speaking the language of IT developers difficult.
IT analysts, on the other hand, might be very familiar with working with traditional relational database engines and development methodologies. They may find working with open source developers of highly distributed, Big Data clusters confusing as well.
MapR Technologies has gone to the effort of putting all of the Big Data tools together into a well-integrated, easy-to-use package that targets enterprise-class projects. The company understands that many organizations want the ability to gather, analyze and report on huge amounts of rapidly changing data even if they don't have the staff expertise to take on an integration project.
MapR Technologies has been involved with Apache Hadoop and Big Data since its inception. The company has addressed itself to removing the challenges organizations face when trying to use Big Data to learn more about itself, its customers and its market. The company has been working to move the state of the art forward and improve the Apache Hadoop family of open source projects.
The company has successfully helped organizations of all sizes and in many markets adopt Hadoop and put it to work solving their problems. MapR Technologies has worked to add "enterprise-grade" features to Hadoop and make the technology ready to take on critical tasks. The company has done this by:
- Making Hadoop and related projects easier to use
- Increasing the performance of Hadoop.
- Making the Hadoop Distributed File System more scalable, more extensible and faster.
- Adding disaster recovery and disaster avoidance features to Hadoop to make it more reliable.
- Simplifying the integration process.
- Enhancing HBase in a number of ways:
- Increasing both its performance and reliability while maintaining compatibility with existing HBase applications.
- Removing the requirement for Region Servers. This simplifies HBase applications and deployments considerably.
- Eliminating garbage collection and compactions to assure high levels of performance.
- Adding common data management by processing tables and files together in the same data layer and with a common namespace.
- Simplifying development by using a single path for both files and tables.
- Making HBase more dependable by enhancing data protection.
- Enabling snapshots of files and tables and providing an "Instant Backup" capability.
Rapid changes in both the regulatory environment and competitive market are forcing organizations to know more about their internal operations and learn something new from every customer and partner engagement. Big Data is a vital response to these pressures. It is more than a passing fad.
Big Data tools, such as the M7 Edition from MapR Technologies, can be very helpful in helping organizations grow and thrive in a highly competitive and rapidly changing world.
It is clear that MapR Technologies is creating a single platform for all Big Data needs and is moving the use of Big Data from a complex computer science project to a systematic and reliable tool. MapR Technologies has made this technology ready for developers in labs, research and production environments.
For more information, please visit MapR Technologies.