The Hadoop market is a fast growing, expanding and exciting ecosystem, but this can also be accompanied by confusion. I thought I’d take a stab at addressing some of the Big Misconceptions about Big Data.  I. First of all, the term Big Data is approaching Cloud in its utter lack of descriptiveness. That said, Big Data is not simply about massive amounts of data – petabytes and beyond. Big Data represents a paradigm shift. It’s about new, unstructured, data sources. It’s about avoiding schema definitions and transformations. There’s no need to structure data before you can derive benefits. It’s about performing data and compute together to perform better and faster analysis. Through Hadoop, organizations can benefit even with relatively small amounts of data.

Since Hadoop is a funny name and somewhat new to people they assume it must be risky. Huge amounts of investment and work have addressed these concerns. Hadoop has emerged as a standard. The rich ecosystem around Hadoop has provided a lot of flexibility, choice, and trained professionals. There are product-grade distributions available, (MapR) that provide full data protection, automatic stateful failover and business continuity. The deployed footprint, complementary products, and available technical resources all contribute to Hadoop adoption. And with that, the number and breadth of deployed Hadoop applications have expanded rapidly.

Another misconception about Hadoop, is that it is a batch process. This is an artifact of the HDFS implementation and not a limitation of Hadoop per se. MapR, for example, provides full support for streaming analytics and real-time processing.

Perhaps the biggest misconception is that Hadoop is a single, monolithic, component. Hadoop is a framework — a complete stack for distributing applications and data. Hadoop supports multiple programming paradigms and includes packages such as Pig, Hive and others. There are packages for data ingress/egress, ETL, and data integration, as well as specific components for machine learning. Most distributions integrate, test and harden these packages along with some proprietary extensions.

With respect to open source, the question about a distribution is not a simple binary “open” or “closed”. The question is what components are open and what areas do proprietary value-added components address. In the case of Cloudera, the proprietary extensions are in the management tools. MapR has chosen to innovate in the areas that provide the most benefits to customers while also being the most difficult for the community to effectively address. These also happen to be areas in which customers have the least desire to modify such as the underlying storage services. MapR’s distribution includes value-added improvements along with all of the open source programming, data access, programming, and machine learning packages.

These are some of the top misconceptions. Let me know what other areas you’d like us to address.

Posted in MapR Technologies Blog | Leave a comment

After joining MapR back in 2009, I spent many months meeting with early Hadoop users and listening to their pain points. In many of these meetings, users described problems related the HDFS architecture and the NameNode in particular. In this blog post I wanted to share 10 NameNode-related issues that came up frequently in these meetings:

  1. We want HA, but the NameNode is a single point of failure. This results in downtime due to hardware failures and user errors. In addition, it is often non-trivial to recover from a NameNode failure, so our Hadoop administrators always need to be on call.
  2. We want to run Hadoop with 100% commodity hardware. To run HDFS in production and not lose all our data in the event of a power outage, HDFS requires us to deploy a commercial NAS to which the NameNode can write a copy of its edit log. In addition to the prohibitive cost of a commercial NAS, the entire cluster goes down any time the NAS is down, because the NameNode needs to hard-mount the NAS (for consistency reasons).
  3. We need both a NameNode and a Secondary NameNode. We read some documentation that suggested purchasing higher-end servers for these roles (e.g., dual power supplies). We only have 20 nodes in the cluster, so this represents a 15-20% hardware cost overhead with no real value (i.e., it doesn’t contribute to the overall capacity or throughput of the cluster).
  4. We have a significant number of files. Even though we have hundreds of nodes in the cluster, the NameNode keeps all its metadata in memory, so we are limited to a maximum of only 50-100M files in the entire cluster. While we can work around that by concatenating files into larger files, that adds tremendous complexity. (Imagine what it would be like if you had to start combining the documents on your laptop into zip files because there was a severe limit on how many files you could have.)
  5. We have a relatively small cluster, with only 10 nodes. Due to the DataNode-NameNode block report mechanism, we cannot exceed 100-200K blocks (or files) per node, thereby limiting our 10-node cluster to less than 2M files. While we can work around that by concatenating files into larger files, that adds tremendous complexity.
  6. We hired a new engineer who did not understand the architectural issues and ran a simple directory traversal (the equivalent of the find command). This created so much load on the NameNode that it simply crashed, and the entire cluster was down.
  7. We need much higher performance when creating and processing a large number of files (especially small files). Hadoop is extremely slow.
  8. We have had outages and latency spikes due to garbage collection on the NameNode. Although we are using the CMS (concurrent mark and sweep) garbage collector, the NameNode still freezes occasionally, causing the DataNodes to lose connectivity (i.e., become blacklisted).
  9. When we change permissions on a file (chmod 400 foo), the changes do not affect existing clients who have already opened the file. We have no way of knowing who the clients are. It’s impossible to know when the permission changes would really become effective, if at all.
  10. We have lost data due to various errors on the NameNode. In one case, the root partition ran out of space, and the NameNode crashed with a corrupted edit log.

When we looked at this list of NameNode-related problems, it was clear to all of us that the only viable solution was to eliminate the NameNode. Our engineering team spent two years re-architecting Hadoop’s storage layer (as well as advancing Hadoop’s MapReduce layer and developing the leading management suite for Hadoop).

The end result is that we have eliminated these 10 issues and many others. In my next blog post I’ll dive deeper into our no-NameNode architecture so that you can understand how it works, and why it really eliminates the issues with NameNode-based architectures (including all planned HDFS enhancements, such as HDFS Federation and HA NameNode). In the meantime, if you’ve run into other NameNode-related problems that I haven’t listed, let me know.

Posted in MapR Technologies Blog | Leave a comment

Having worked in hyper-growth companies, (3Com for 8 years, where we grew from a modest base to $5.5 Billion, and with VMware for the past 7 years) I’ve learned the keys to success. The formula for success I saw in these high growth companies includes technology leadership, a great management team, and a solid strategy. In our industry, the formula is relatively simple. Putting that formula to work is the difficult part. To me, MapR has the keys to success and is on the hyper-growth journey we all want to be a part of.

I believe another critical element is the commitment to partnering to achieve this growth if the foundational formula is in place. If you are a consulting partner, MapR will work with you to position your business advisory services, POC or Pilots, and other services. If you are a technology partner, we’ll work with you to differentiate our joint offering to customers. We do that with benchmarking, interoperability statements, reference architecture, joint promotions, event marketing, etc.

In the short time I have been with MapR, I’ve been impressed by the involvement of partners with our customers and MapR’s commitment to the channel. We want our partners to profit from services. The expertise in the Hadoop is finite today, but that is a big opportunity for our partners. The demand is there–we are seeing it from our customers. Our partners will profit from that demand on both the consulting and training front.

Joining MapR was an easy decision for me. If you are a consulting or technology partner, working with MapR should be an easy decision for you as well. I invite you to ask yourself a few simple questions: Who is the best company to partner with? Who provides the easiest to use and the most dependable product in the Big Data space? Who isn’t going to compete for training or consulting dollars in your accounts? Whose management team understands and is committed to partnering? The answer is MapR. To hear more, view my video and learn more about partnering with MapR.

Posted in MapR Technologies Blog | Leave a comment

Our CEO, John Schroeder was recently interviewed in the press and asked about his predictions for Hadoop in 2012. Simply put, he sees a Big year for Big Data. It’s not just the scale of data growth. John shared his view that the ability to process and analyze Big Data is changing the game for companies and it’s changing the game in every aspect of their business.

Many enterprise IT vendors are wrapping themselves in some sort of “Big Data” cloak. The storage vendors talk about how they’ve always understood Big Data and cite the petabytes of storage that customers manage with their technology. The database and data warehouse providers make similar claims. Virtually, every vendor has some sort of Big Data presentation and by virtually every vendor I am of course including the virtualization providers. When MapR refers to Big Data we are talking about the MapReduce framework. This is the game changing approach popularized by Google.

We tend to take Google’s dominance for granted but when the Google search beta debuted in 1998 it was the 19th search engine on the market. The market was already well served with Yahoo, Excite, Infoseek, AltaVista, and a host of others. Within two short years, Google was the dominant player. The reason? MapReduce enabled Google to index much more data, much more quickly, and much more cheaply than any other provider. MapReduce is a paradigm shift, a new architecture that trumps existing approaches and provides any organization with the same power of changing their respective competitive landscapes. Google published a white paper on MapReduce in 2003. A Yahoo engineer, named Doug Cutting read the paper and the result was Hadoop. We’ve seen Hadoop emerge as a robust ecosystem with innovations happening across the Hadoop stack.

This is unfolding to be a big year. John’s predictions for 2012 encompass five major developments in Big Data. These include:
• Hadoop emerges as the safe platform choice for Big Data. The deployed footprint, complementary products, and available technical resources all reinforce the adoption of Hadoop.
• Real-time analytics take-off. Analyzing streaming data from application logs to messages augments existing batch applications.
• Hadoop applications move from experimental to mission critical. The number and breadth of deployed Hadoop applications also expands.
• Consulting firms augment their offerings with Hadoop specific consulting services expanding the number of available services vendors. Organizations benefit from the large and growing education and consultancy services.
• Big Data is no longer limited to companies that can ‘roll their own’ as the application ecosystem expands rapidly. In addition to the rapid expansion of Hadoop applications, 2012 sees the emergence of applications and services that leverage an underlying Hadoop engine.

We’re looking forward to a big year. We hope you join us.

Posted in MapR Technologies Blog | Leave a comment

Today we announced version 1.2 of the MapR Distribution for Apache Hadoop.  With this release, MapR continues to push the envelope by making Hadoop more accessible to  more users, more languages, and more platforms. This release includes numerous features and capabilities including:

  • Ability to take advantage of next generation resource management framework: MapR users will be able to take advantage of MapReduce 2.0 once it is ready for production use. Although it is expected to take several months for the community to stabilize Hadoop 0.23, users will be able to take advantage of the combined benefits of MapReduce 2.0, such as backward-compatibility and scalability and MapR’s unique capabilities, such as HA (no lost tasks or jobs during a JobTracker or ApplicationMaster failure) and the high-performance shuffle.
  • High-performance native access library: With Version 1.2, MapR provides a libhdfs implementation that bypasses Java altogether and provides high-performance access to the distributed file system from C/C++ applications and other compatible scripting languages. There is no need to recompile applications that use libhdfs, since the API (header file) is identical.
  • Upgrade of various packages including HBase, Hive and Pig: The HBase package in the MapR distribution has been upgraded to release 0.90.4. In addition, MapR has identified several critical stability and data corruption issues in 0.90.4, which we have addressed by backporting 15 fixes from future HBase releases. Versions of Hive and Pig have also been upgraded in the MapR distribution, so users can leverage the latest bug fixes and features available from these Apache projects.
  • MapR Virtual Machine (VM). MapR now provides a VMWare virtual machine that allows users to experiment with the MapR distribution. Although this environment is not suitable for any performance or scale testing, it makes it easy to experiment with some of MapR’s unique capabilities, such as NFS and snapshots. The VM is also a great asset if you are new to Hadoop, because you could be up and running on any environment (e.g., your laptop) within minutes.
  • Additional performance improvements. The MapR distribution is already 2-5x faster than other distributions on typical Hadoop workloads, including the standard DFSIO and Terasort benchmarks, resulting in a significant hardware cost reduction. The 1.2 release continues to push the envelope, with a number of performance improvements in the platform (file system and MapReduce layers).
Posted in MapR Technologies Blog | Leave a comment

It’s no surprise to hear that data is growing quickly. An IDC study earlier this year confirmed that data is growing faster than Moore’s Law. This means that however you’re processing data today, tomorrow  you’re going to be doing it with many more servers. Clusters will continue to expand within your environment.

Put another way, the rate of data growth has changed the bottleneck. The network is now the bottleneck, not the disk. The amount of data to analyze, makes it unwieldy to drag it across the network. It’s much more efficient to perform data and compute together and send the results over the network.

This introduces a new computing paradigm, and is the driver for MapReduce. A poster child for this is Google. We now take Google’s dominance for granted, but when Google launched their beta in 1998 they were late. They were the 19th search engine to enter the market. Yahoo was dominant, there was Infoseek, Excite, Lycos, Ask Jeeves, AltaVista, and a host of others. Within two years Google was the leader. It wasn’t until Google published a paper in 2003 that we got a glimpse of their back-end architecture. Google was able to reach dominance because they recognized early on the paradigm shift and they were able to index more data, get better results and do it much, much more efficiently and cost effectively than their competitors. They went from 19th to first in a few short years because of MapReduce.

A Yahoo engineer by the name of Doug Cutting read that same paper in 2003 and developed a Java implementation of MapReduce named after his son’s stuffed elephant that became the basis for the open source Hadoop project. Hadoop has grown to include a robust ecosystem. MapR is dedicated to expand the capabilities of Hadoop to bring the full promise of MapReduce to all organizations. With the incredible power of MapReduce it’s important for your organization to realize the benefits before the 19th player in your market moves to dominance. We’re here to help.

Posted in MapR Technologies Blog | Leave a comment

At Hadoop World last week we announced the MapR Academy. This is our free training resource with videos and documents to help administrators, developers and business users get the information they need to be effective and get the most out of their Big Data. We had several MapR Virtual Trainers attend the conference with Ipads of our training videos and access to the website to show the full complement of our training resources. The response was fantastic. Our Virtual Trainers had a great time interacting with other attendees. We had a video crew collecting feedback on what training topics Hadoop users find useful and what they’d like to see in the future.  We will continue to add materials and if there are any training topics you would like to see covered, let us know.

Posted in MapR Technologies Blog | Leave a comment

Recently a world record was claimed for a Hadoop benchmark.  MapR has run numerous benchmarks where MapR performs 2 to 5 times faster than other distributions and have published these results. So a world record was quite a claim. We were surprised to see that this world record was for a TeraSort benchmark on a 100GB of data.

TeraSort is a standard benchmark and the name is derived from “sorting a terabyte”.  Any record claims for sorting a 100GB dataset across a 20 node cluster with 10 times as much memory is comical. The test is named TeraSort not GigaSort. The world-record claim is like someone splashing across a kiddie pool and announcing a swimming record. It doesn’t tell you anything about how fast they swim, and it’s quite possible that this “world record holder” may drown before reaching the deep end of a real pool.

Hadoop is about Big Data and maintaining performance while scaling.  It’s a benchmark for Big Data not Big Memory. Modern machines have a 1000 to 1 ratio of disk to memory, and any benchmarks for Big Data should reflect that fact by ensuring that the data processed is indeed Big.

It’s more than a little disingenuous to consider the world record claim representative of Big Data. A representative benchmark should involve data that’s at least 10X the memory (and not 1/10th as in the kiddie pool test). In particular, the test doesn’t show how well their code handles disk operations which are the slowest part of a TeraSort test. Similarly, any speed claims with an HBase test where everything fits into memory isn’t valid.

Just as comparing the time it takes to run across a kiddie pool and then extrapolating that to a marathon swimming distance is silly, so is it invalid to take results for the 100GB memory-only test and extrapolate that rate to 1TB TeraSort.  But even if we did perform a straight line extrapolation on the 100GB sort of 130 seconds, it would take 22 minutes to perform a 1TB TeraSort. MapR has published a 1TB TeraSort result that took 22 minutes on a 10 node cluster with half the CPUs and 1/4th the memory. Given the differences in cluster capabilities MapR is at least 2.5X faster and that’s in comparison with a highly questionable extrapolation for the other distribution.  The conclusion here is that this world record is not a splash, but is certainly all wet.

Posted in MapR Technologies Blog | Leave a comment

Oracle announced an Oracle Big Data Appliance (BDA) including Hadoop at Oracle OpenWorld this week. Oracle is packaging the Apache code on the BDA appliance. The announcement didn’t include any important 3rd party partnerships or any important innovations for Hadoop.

According to published information, the BDA appliance will likely be a full rack with 18 2U servers, each with 12x2TB, 48GB RAM and 2×6 cores. In addition to Apache Hadoop, it includes a new Oracle NoSQL Database, several Oracle-Hadoop connectors and the R library. This product is not available yet, so the technical details are limited.

From what we’ve seen, customers prefer to buy off-the-shelf servers and configure the software, because this approach results in lower costs and more flexibility. From this view, Hadoop is more like a database than like storage. Storage is typically sold as an integrated appliance where most databases are software only.

Oracle joins a growing list of commercial vendors providing Hadoop-related products and services. Hadoop has grown beyond an Apache project to include EMC, IBM, Oracle and focused Hadoop companies like MapR. Offering organizations a broad and deep set of products and services only fuels Hadoop growth and makes Hadoop an even safer big data platform choice for customers.

While we continue to welcome new Hadoop distributions into the market, MapR remains alone in our approach to provide a differentiated Hadoop distribution that includes deep innovations for performance, ease of use, and dependability advantages. We’re focused on maintaining our technology lead and are continuing to innovate to provide customers real value and a unique choice.

Posted in MapR Technologies Blog | Leave a comment

The Strata Conference held last week in New York – not to be confused with the Strata Summit which took place the two days prior at the Marriott or the Strata Jumpstart that was held on Monday – well maybe it was confusing. The Strata Conference covered the entire data supply chain from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively. There was a lot of energy and buzz that wasn’t dampened by the rain or the traffic caused by the UN being in session and blockades for Obama.

Our own Ted Dunning presented a talk on “Benefiting from MapReduce without the Risk”. He started with a report card on Hadoop. Overall the report card was positive but Ted highlighted several areas for closer review including: Hadoop wasn’t working to its potential and it didn’t play well with others. Part of not working to its potential is that underlying architectural issues result in downtime and data loss. At this point in the talk, Ted put on his red MapR baseball cap, to denote that he was now representing MapR, and talked about MapR’s specific improvements that address these issues.

Not playing well with others refers to the difficulty in integrating Hadoop into environments, and getting data in and out of Hadoop. Ted again put on his hat and provided specifics on MapR.

The talk also went through vignettes drawn from real-life situations that exposed the challenges customers have faced and descriptions of how large-scale analytical technologies can be done without disrupting existing applications. Organizations are beginning to analyze and derive business value from large amounts of data that, in many cases, were previously simply being discarded.

So one of the takeaways from the Strata Big Data conference – Or the Strata Summit I can’t remember which – don’t discard data. It could be valuable with the right solution.

Posted in MapR Technologies Blog | Leave a comment