Strata + Hadoop World 2014
New York, NY
Wednesday, October 15, 2014
Friday, October 17, 2014

Strata + Hadoop World in New York is where big data's most influential business decision makers, strategists, architects, developers, and analysts gather to shape the future of their businesses and technologies. MapR is proud to be an Elite Sponsor of Strata + Hadoop World 2014.


Doing the Impossible (Almost): A Survey of Approximation Algorithms that Make Queries Vastly Faster

Ted Dunning View Bio

October 15, 2014 at 9:05am

Part of Hardcore Data Science Track

Computing various quantities such as medians or the number of unique elements requires a lot of time, a lot of memory, or both. It is, however, possible to get really close to the exact answer with much less time and much less memory. Some of these algorithms are much simpler than you might expect. Ted will describe a selection of these algorithms including some not yet published results. Ted will also outline how these algorithms can be applied to practical problems like anomaly detection.

Getting Started with HBase Application Development (Tutorial)

Carol McDonald & Sridhar Reddy View Bio

October 15, 2014 at 1:30pm

Having problems scaling your SQL database?

HBase allows you to build big data applications for scaling your database needs, and this tutorial will help you get a jump start on HBase development. We’ll start with a quick overview of HBase, the HBase data model, and architecture, and then we’ll dive directly into code to help you understand how to build HBase applications. We will also offer guidelines for good schema design, and will cover a few advanced concepts such as using HBase for transactions.

This tutorial will cover:

• An introduction to the HBase data model and HBase architecture

• Setting up a Sandbox [one-node cluster on your laptop]

• Using the HBase shell to create HBase tables and insert data

• An introduction to the basic Java API used to perform CRUD operations on HBase tables

• Understanding how the data flows for writes and reads

• Schema design concepts for rowkey design

• Advanced Java APIs for performing scans and transactions

Just Enough Math (Tutorial)

Allen Day View Bio

October 15, 2014 at 1:30pm

This tutorial provides a hands-on programming intro to advanced math for business people — showing “just enough math” to take advantage of some popular open source frameworks, based on a new O’Reilly book by the authors.

The premise is that many people take university-level math, up until the “killing fields” of calculus. Most did not continue beyond that, but still have an interest. Meanwhile, math programs in many universities cling tenaciously to Cold War-era priorities, intent on weeding out people who would not pass requirements as engineers to build missiles, etc.

With the commercial successes of Machine Learning, Cloud Computing, etc., there are very good business cases for having “just enough math” to leverage new kinds of open source tools. These days people in business need to understand more about complex graphs, sparse matrices, Bayesian priors, optimization solvers, etc., which are not hard to learn but placed far beyond calculus.

As a case in point: in preparation for their recent IPO, Twitter overhauled their revenue apps to emphasize applying semigroups, monoids, rings, algebraic graph theory, etc., to leverage functional programming for efficient parallel processing at scale. Those topics may sound obscure, but 50 lines of Python illustrate the math clearly.

The formula applied in the tutorial is simple: a series of sections build on each other, where each introduces a few clear math concepts, discusses the history and typical uses, along with a sample business use case illustrating how to leverage that math, followed by brief code examples in Python that show how to solve for the use case. Then we look at open source projects which get used in production for similar kinds of work.

What Would Google Do? Understanding the Future of Big Data

M.C. Srivas View Bio

October 16, 2014 at 9:10am

Google knows what it takes to build data-driven businesses. Over the years, Google has faced many technology scaling challenges and answered the call with innovations for big data, as well as big data center compute, big networking and big storage. Google inspired Hadoop and many other innovation in the market. Luckily, we can understand "what's next" easily because Google sends us postcards from the future. If you want to know what's coming next in big data, just as yourself, "what would Google do?"

Got the T-shirt: Real Experiences from a Hadoop Veteran

Jim Scott View Bio

October 16, 2014 at 11:00am

Jim Scott has "been there, done that, got the t-shirt" when it comes to Hadoop. A veteran of three successful Hadoop projects at companies in ad media and retail markets, Jim will share his experiences and best practices for deploying Hadoop in real-time, mission critical environments. Come to this session, get a free/fun Hadoop t-shirt of your own, and get specific advice on architecture, administration, database operations, time-series data and how it relates to online environments and can be applied to the Internet of Things (IoT) and use cases in data center operations, advertising, communication service providers (CSP), and the industrial Internet. Learn the critical success factors for organizational success with Hadoop and building the right team and skill sets for high performance Hadoop success. 

Get Real with Hadoop

Jim Scott View Bio

October 16, 2014 at 5:00pm

There is a lot of hype about what's possible with Hadoop. Get real and don't be fooled. Attend this lightning session of the top 10 biggest misperceptions and differences between Hadoop distributions. Get real and get armed with information you need to cut through the hype and choose the right solution for your business.


Renaissance in Medicine: Next-Generation Big Data Workloads

Allen Day View Bio

October 17, 2014 at 1:45pm

Instead of using 1s and 0s (base2), biological software is encoded as A, T, C, and G (base4). DNA sequencers are simply devices for converting information encoded in base4 to base2. Improvements in DNA sequencing technology are happening at a rate that outstrips even Moore's Law of Computing. As a result, the number of human genomes converted to base2 and uploaded for analysis is rapidly increasing.

Medicine is undergoing a renaissance that is made possible by analyzing and creating insights from this huge and growing number of genomes. Personalized medicine is simply the practical application of these insights.

In this session, Allen will show how ETL and MapReduce can be applied in a clinical setting. I will also show how NoSQL and advanced analytics can be used to "reverse engineer" the genetic causes of disease. Such information can be used to predict and prevent individual suffering, as well as to increase the overall health of a society.


Ted Dunning

Ted Dunning is Chief Application Architect at MapR Technologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects​. Ted has been very active in mentoring new Apache projects and is currently serving as vice president of incubation for the Apache Software Foundation​.​ Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (later purchased by LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting..

Carol McDonald & Sridhar Reddy

Carol is a HBase Hadoop instructor at MapR Technologies. Carol has extensive experience as a developer and architect building complex mission critical applications in the Banking, Health insurance and Telecom Industries. As a Java Technology Evangelist at Sun Microsystems, Carol traveled all over the world speaking at Sun Tech Days, JUGs, Companies, and Conferences. She is a recognized speaker in Java communities.

Sridhar is Director of Professional Services for MapR Technologies. Sridhar has over 20 years of experience working with Java and JaveEE in many roles of the software development life cycle, including design, development, management, training and technology evangelism. Prior to MapR, Sridhar managed the Java Platform development team at PayPal, where he led a team of Java developers to build the next generation of the Java platform. Prior to PayPal, Sridhar worked as a Technology Evangelist at Sun Microsystems for over 10 years, where he increased awareness and adoption of Java technology in the worldwide developer community. While at Sun, Sridhar also managed the JavaOne Hands-On Labs as well as Sun Tech Days, a worldwide developer conference. Sridhar holds an MS in Computer Science from the Florida Institute of Technology.

Allen Day

Allen is the Principal Data Scientist at MapR Technologies, where he leads interdisciplinary teams to deliver results in fast-paced, high-pressure environments across several verticals in industry. Previously, Allen founded TinyTube Networks which provided the first mobile video discovery and transcoding proxy service, and Ion Flux which provided a medical-grade, cloud-based human genome sequencing service.

Allen has contributed to a wide variety of open source projects: R (CRAN, Bioconductor), Perl (CPAN, BioPerl), FFmpeg, Cascading, Apache HBase, Apache Storm, and Apache Mahout. Overall, his unique background combines deep technical expertise in data science with a pragmatic understanding of real-world problems. He also pursues interests in linguistics and economics, and — if it hadn’t been obvious — he performs magic.

M.C. Srivas

Srivas is MapR's co-founder. Srivas ran one of the major search infrastructure teams at Google where GFS, BigTable and MapReduce were used extensively. He wanted to provide that powerful capability to everyone, and started MapR on his vision to build the next-generation platform for semi-structured big data. That vision is shared by all at MapR. Srivas brings to MapR his experiences at Google, Spinnaker Networks, Transarc in building game-changing products that advance the state of the art.

Jim Scott

Jim drives enterprise architecture and strategy at MapR. Jim Scott is the cofounder of the Chicago Hadoop Users Group. As cofounder, Jim helped build the Hadoop community in Chicago for the past four years. He has implemented Hadoop at three different companies, supporting a variety of enterprise use cases from managing Points of Interest for mapping applications, to Online Transactional Processing in advertising, as well as full data center monitoring and general data processing. Prior to MapR, Jim was SVP of Information Technology and Operations at SPINS, the leading provider of retail consumer insights, analytics reporting and consulting services for the Natural, Organic and Specialty Products industry. Additionally, he served as Lead Engineer/Architect for dotomi, one of the world’s largest and most diversified digital marketing companies. Prior to dotomi, Jim held several architect positions with companies such as aircell, NAVTEQ, Classified Ventures, Innobean, Imagitek, and Dow Chemical, where his work with high-throughput computing was a precursor to more standardized big data concepts like Hadoop.