Hadoop Summit 2014
San Jose, CA
Tuesday, June 3, 2014
Thursday, June 5, 2014

MapR Technologies is proud to be a platinum sponsor of Hadoop Summit 2014, the leading conference for the Apache Hadoop community. This event will feature many of the Apache Hadoop thought leaders who will showcase successful Hadoop use cases, share development and administration tips and tricks, and educate organizations about how best to leverage Apache Hadoop as a key component in their enterprise data architecture.


Delivering on the Hadoop/HBase Integrated Architecture

Dale Kim View Bio

June 3, 2014 4:35pm-5:15pm

Hadoop has gained tremendous traction in the past few years as a massively scalable, enterprise-wide analytical system. Its flexibility and cost advantage make it an attractive platform for running batch-oriented and interactive analytics on large data sets.

The growing popularity of in-Hadoop databases like HBase let businesses consolidates their operational and analytical workloads into a single cluster. This overcomes the traditional model of copying data from the operational system to the analytical system, which often creates significant latency and introduces additional complexity and cost.

Tomer will describe how businesses can begin to integrate operational data into their analytics clusters. He will describe the tools businesses should use, including SQL-on-Hadoop technologies and NoSQL databases like HBase. The combination of these tools offers the familiarity of SQL with the speed and scalability of NoSQL databases.

Architecting R into the Storm Application Development Process

Allen Day View Bio

June 3, 2014 5:25pm-6:05pm

The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.

In this presentation, Allen will build a bridge from basic real-time business goals to the technical design of solutions. We will take an example of a real-world use case, compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution.

How to Find What You Didn't Know to Look For, Practical Anomaly Detection Anomaly

Ted Dunning View Bio

June 4, 2014 2:35pm-3:15pm

Anomaly detection is the art of automating surprise. To do this, we have to be able to define what we mean by normal and recognize what it means to be different from that. The basic ideas of anomaly detection are simple. You build a model and you look for data points that don’t match that model. The mathematical underpinnings of this can be quite daunting, but modern approaches provide ways to solve the problem in many common situations.

Ted will describe these modern approaches with particular emphasis on several real use-cases including, rate shifts in web traffic or purchases, time series, and topic spotting to determine when new topics appear in text.

Hadoop and R Go to the Movies, Visualization in Motion

Ted Dunning View Bio

June 4, 2014 5:25pm-6:05pm

Apache Hadoop excels at large computations. R excels at static visualization. When used together, R and Hadoop can be used to produce compelling videos that can explain difficult relationships even more effectively than static pictures. This isn't easy, however, and even if a picture is worth a thousand words and a video a few more, you need to have tools that make it easier to produce a video than it is to write an essay.

Ted will demonstrate how standard tools like R can produce videos (slowly) and how Hadoop can be used to do this quickly. Sample code will be provided that illustrates just how this can be done. He will also demonstrate how you can adapt these examples to your own needs.

How to Determine Which Algorithms Really Matter

Ted Dunning View Bio

June 5, 2014 2:10pm-3:25pm

Figuring out what really matters in data science can be very hard. The set of algorithms that matter theoretically is very different from the ones that matter commercially. Commercial importance often hinges on ease of deployment, robustness against perverse data and conceptual simplicity. Often, even accuracy can be sacrificed against these other goals. Commercial systems also often live in a highly interacting environment so off-line evaluations may have only limited applicability.

In this talk, Ted will show how to tell which algorithms really matter and go on to describe several commercially important algorithms such as Thompson sampling (aka Bayesian Bandits), result dithering, on-line clustering and distribution sketches and will explain what makes these algorithms important in industrial settings.


Dale Kim

Dale is the Sr. Director of Industry Solutions at MapR. His background includes a variety of technical and management roles at information technology companies. While his experience includes work with relational databases, much of his career pertains to non-relational data in the areas of search, content management, and NoSQL, and includes senior roles in technical marketing, sales engineering, and support engineering. Dale holds an MBA from Santa Clara University, and a BA in Computer Science from the University of California, Berkeley.

Allen Day

Allen is the Principal Data Scientist at MapR Technologies, where he leads interdisciplinary teams to deliver results in fast-paced, high-pressure environments across several verticals in industry. Previously, Allen founded TinyTube Networks which provided the first mobile video discovery and transcoding proxy service, and Ion Flux which provided a medical-grade, cloud-based human genome sequencing service.

Allen has contributed to a wide variety of open source projects: R (CRAN, Bioconductor), Perl (CPAN, BioPerl), FFmpeg, Cascading, Apache HBase, Apache Storm, and Apache Mahout. Overall, his unique background combines deep technical expertise in data science with a pragmatic understanding of real-world problems. He also pursues interests in linguistics and economics, and — if it hadn’t been obvious — he performs magic.

Ted Dunning

Ted Dunning is Chief Application Architect at MapR Technologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects​. Ted has been very active in mentoring new Apache projects and is currently serving as vice president of incubation for the Apache Software Foundation​.​ Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (later purchased by LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting..