Strata + Hadoop World 2015
San Jose, CA
Tuesday, February 17, 2015
Friday, February 20, 2015
Strata + Hadoop World brings together the best minds in strategy, science, and industry for the defining event of the data industry. Explore solutions to challenging problems, connect with the brightest minds in data, and find out what’s new in emerging technologies and Apache Hadoop.


Keynote: Impacting Business as it Happens

Anil Gadre View Bio

February 19, 2015 at 9:15am

To get value out of today’s big and fast data, organizations must evolve beyond traditional analytic cycles that are heavy with data transformation and schema management. The Hadoop revolution is about merging business analytics and production operations to create the 'as-it-happens' business. It’s not a matter of running a few queries to gain insight to make the next business decision, but to change the organization's fundamental metabolic rate. It is essential to take a data centric approach to infrastructure to provide flexible, real-time data access, collapsing data silos and automating data-to-action for immediate operational benefits.

Real World Use Cases: Hadoop and NoSQL in Production

Ted Dunning View Bio

February 19, 2015 at 10:40am - co-presenting with Ellen Friedman, Solutions Consultant

What’s important about a technology is what you can use it to do. We’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and we’d like to relay what worked well for them and what did not. Drawing from real world use cases, we show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Among the examples presented, we examine the use of a Hadoop and NoSQL foundation to detect security threats in financial settings, to optimize data warehouse utilization, to improve marketing efficiency in a cost-effective way and to build a huge biometric database with the goal of changing society. The examples presented should be helpful to business audiences and developers alike.

There are consistent themes in successful projects that aren’t what most people expect. We will describe some of the themes we have found in our survey of projects.

Stream Processing Everywhere -- What to Use

Jim Scott View Bio

Feburary 19, 2015 at 11:30am

Processing data from social media streams and sensors devices in real-time is becoming increasingly prevalent and there are plenty open source solutions to choose from. To help practitioners decide what to use when we compare three popular Apache projects allowing to do stream processing: Apache Storm, Apache Spark and Apache Samza.

Drill into Drill: How Providing Flexibility and Performance is Possible

Jacques Nadeau View Bio

February 19, 2015 at 2:20pm

This will be a deep technical talk on how Drill can achieve lightning fast performance and provide ground-breaking flexibility and ease of use. Key things we’ll cover include:

• The nature of query planning and statistics in a first read scenario.

• Switching between runtime and compile time code generation depending on workload nature.

• Extensive coverage of code optimization and planning techniques.

• Support for partial dynamic schema subsets and how they enable high performance

• Advanced memory use, columnar in memory execution and moving between Java and C as necessary

• Making a statically typed language appear dynamic through the use of the Any time and multi-phased planning.

Maintaining Low Latency While Maximizing Throughput on a Single Cluster

Yuliya Feldman View Bio

February 19, 2015 at 4:50pm

The good news: Hadoop has a lot of tools. The bad news: Hadoop has a lot of use cases with conflicting sensitivities. And a lot of tools. This talk will showcase how new advances in YARN and Mesos provide the ability to truly run multiple distinct workloads together. This talk is about using SLA and latency rules along with preemption within YARN and dummy resource allocations allow someone to maintain a high cluster throughput while guaranteeing latency SLAs for interactive and low-latency applications such as Apache HBase and Drill. NOTE: the password to the video is “mapr2015”.

YARN vs. MESOS: Can’t We All Just Get Along?

Ted Dunning View Bio

February 20, 2015 at 2:20pm

In the battle for datacenter resource management, there are two heavyweights duking it out for the world championship. In the red corner is YARN, a big data contender and the successor to MapReduce 1. In the blue corner is MESOS with it’s UC Berkeley pedigree and it’s proven performance at Twitter, Airbnb and Netflix. This is a battle that Don King would be ecstatic to promote. But maybe we could build a more powerful fighter by combining the best of both. What if you didn’t have to choose? What if you could use both MESOS and YARN in concert, each for what it is especially good at, rather than choosing? In this talk we will cover:

  • The differences between YARN and Mesos

  • How typical datacenters deploy both of these technologies in isolation

  • Why they are seen as competitors

  • How they can, instead, be used together

  • A demonstration of YARN and MESOS collaboratively sharing cluster resources

  • Case studies of actual production implementations



Hadoop as a Platform for Genomics

Allen Day View Bio

February 20, 2015 at 4:00pm

Personalized medicine holds much promise to improve the quality of human life. However, personalizing medicine depends on genome analysis software that does not scale well. Given the potential impact on society, genomics takes first place among fields of science that can benefit from Hadoop.

A single human genome contains about 3 billion base pairs. This is less than 1 gigabyte of data but the intermediate data produced by a DNA sequencer, required to produce a sequenced human genome, is many hundreds of times larger. Beyond the huge storage requirement, deep genomic analysis across large populations of humans requires enormous computational capacity as well.

Interestingly enough, while genome scientists have adopted the concept of MapReduce for parallelizing I/O, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally.

Efforts exist for adapting existing genomics data structures to Hadoop, but these don’t support the full range of analytic requirements. Our approach is to implement an end-to-end analysis pipeline based on GATK and running on Hadoop. The benefit of combining GATK and Hadoop is two-fold. First, Hadoop provides a more cost-effective solution than a traditional HPC+SAN substrate. Second, Hadoop applications are much easier for software engineers to design and scale. In this work, we show what it took to run GATK on Hadoop and then we show example results and scaling characteristics.

Our solution is elegant and follows both the Hadoop and the GATK best practices. Results can be generated on easily available hardware and users can expect immediate ROI by moving existing GATK use cases to Hadoop.

Act Fast: Enabling a Real-time Unified Data Architecture

Steve Wooledge View Bio

February 19, 2015 at 3:10pm in Teradata's booth 1315

The value of big data is not only in decision support for a few big decisions, but in 1000’s of smaller decisions, happening at speed (milliseconds). Come hear MapR talk about the growing role of Hadoop for streaming and operational applications in the Teradata Unified Data Architecture.

Example applications include fraud detection, ad platforms making constants improvements to content relevancy, or on-the-fly adjustments in manufacturing production or energy generation.


Anil Gadre

Anil Gadre is the SVP of Product Management at MapR. Prior to MapR, Anil was the EVP of Product Management at Silver Spring Networks, responsible for product strategy, planning and marketing of networking and software products focused on the Smart Grid for the energy industry. Before that, Anil was with Sun Microsystems, a Fortune 200 technology leader, serving as EVP of The Application Platform Software organization and had previously been the Chief Marketing Officer leading global branding, demand creation and an extensive developer ecosystem program. At Sun Microsystems his experience covered diverse product lines ranging from networked desktop and enterprise servers systems to market leading software products such as the Solaris Operating system, Java, MySQL database and various middleware products. He has a BSEE from Stanford University, and an MM degree from the Kellogg School at Northwestern University.

Ted Dunning

Ted Dunning is Chief Application Architect at MapR Technologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects​. Ted has been very active in mentoring new Apache projects and is currently serving as vice president of incubation for the Apache Software Foundation​.​ Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (later purchased by LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting..

Jim Scott

Jim drives enterprise architecture and strategy at MapR. Jim Scott is the cofounder of the Chicago Hadoop Users Group. As cofounder, Jim helped build the Hadoop community in Chicago for the past four years. He has implemented Hadoop at three different companies, supporting a variety of enterprise use cases from managing Points of Interest for mapping applications, to Online Transactional Processing in advertising, as well as full data center monitoring and general data processing. Prior to MapR, Jim was SVP of Information Technology and Operations at SPINS, the leading provider of retail consumer insights, analytics reporting and consulting services for the Natural, Organic and Specialty Products industry. Additionally, he served as Lead Engineer/Architect for dotomi, one of the world’s largest and most diversified digital marketing companies. Prior to dotomi, Jim held several architect positions with companies such as aircell, NAVTEQ, Classified Ventures, Innobean, Imagitek, and Dow Chemical, where his work with high-throughput computing was a precursor to more standardized big data concepts like Hadoop.

Jacques Nadeau

Jacques Nadeau leads Apache Drill development efforts at MapR Technologies. He is an industry veteran with over 15 years of big data and analytics experience. Most recently, he was cofounder and CTO of search engine startup YapMap. Before that, he was director of new product engineering with Quigo (contextual advertising, acquired by AOL in 2007). He also built the Avenue A | Razorfish analytics data warehousing system and associated services practice (acquired by Microsoft).

Yuliya Feldman

Yuliya Feldman is Principal Software Engineer at MapR, where she is responsible for security initiatives, development, currency, and advancement of Open Source projects supported by MapR. She has over 20 years of software development experience, with seven of them at eBay as a Staff Software Engineer, as well as seven years working with Hadoop. Since joining MapR in 2010, Yuliya started and participated in the development of a number of key MapR features. Yuliya holds an M.S. Degree in Applied Mathematics from V.N. Karazin Kharkiv National University in the Ukraine.

Allen Day

Allen is the Principal Data Scientist at MapR Technologies, where he leads interdisciplinary teams to deliver results in fast-paced, high-pressure environments across several verticals in industry. Previously, Allen founded TinyTube Networks which provided the first mobile video discovery and transcoding proxy service, and Ion Flux which provided a medical-grade, cloud-based human genome sequencing service.

Allen has contributed to a wide variety of open source projects: R (CRAN, Bioconductor), Perl (CPAN, BioPerl), FFmpeg, Cascading, Apache HBase, Apache Storm, and Apache Mahout. Overall, his unique background combines deep technical expertise in data science with a pragmatic understanding of real-world problems. He also pursues interests in linguistics and economics, and — if it hadn’t been obvious — he performs magic.

Steve Wooledge

Steve is Vice President, Product Marketing at MapR, and is responsible for identifying new market opportunities and increasing awareness for MapR technical innovations and solutions for Hadoop. Steve was previously Vice President of Marketing for Teradata Unified Data Architecture, where he drove Big Data strategy and market awareness across the product line, including Apache Hadoop. Steve has also held various roles in product and corporate marketing at Aster Data, Interwoven, and Business Objects, as well as sales and engineering roles at Business Objects and Dow Chemical. When not working, Steve enjoys juggling activities for his 5 kids and sneaking in some cycling or ski trips when he can.