Berlin Buzzwords is a conference for developers and users of open source software projects focussing on the issues of scalable search, data-analytics in the cloud and NoSQL database. MapR is proud to be delivering 3 key presentations during the two-day conference.
Ted Dunning, Chief Application Architect and Michael Hausenblas, Chief Data Engineer - View Bio
Apache Drill Implementation Deep Dive
Apache Drill is an exciting project that aims to provide SQL query capabilities for a wide variety of data sources in an extensible way.
But the technologies underneath the implementation are also very exciting even outside of the context of Drill itself. These ideas can be repurposed for a wide variety of other uses either by directly extracting code from Drill, or by using the philosophies and ideas in new forms.
I will talk about how Drill goes about several key tasks including:
Multi-modal Recommendation Algorithms
Classic collaborative filtering uses a single kind of behavior recorded as a relation between users and a single class of objects. The classic examples include people buying books, people watching movies or people listening to music.
In the real world, however, real people interact in many ways with many kinds of things. People even interact (abstractly) with intangible, abstract entities such as musical styles or different food cuisines. They say things. They go places. These many kinds of behavior give strong clues about what other things these people might like to do and recommendation engines should use these multi-modal cues to make better recommendations.
I will describe how the basic mathematical structure of recommendation engines can be extended to take account of these many kinds of recommendations using a "pivot-set" representation. This leads to an extremely straightforward outline for how to design a multi-modal recommendation algorithm.
Mathematics and algorithms are not enough, however. Real-world implementation is required.
Happily, multi-model recommendation algorithms can be implemented in a remarkably simple fashion. I will provide a complete outline of how this can be done using standard tools like Pig, Mahout and SolR. I will also provide concrete examples of how this works in the real world.
Real-time Learning for Fun and Profit
I will describe how real-time Bayesian learning can optimize real processes. These processes can be web-sites, user interfaces, ad-targeting servers, back-end server farms, search engines or physical systems. Google, Microsoft and Yahoo have all adopted the algorithms described in this talk for ad targeting because they produce superior results, but you don't have to be on that scale to benefit as well.
The positive impact of these new learning techniques can be massive. Conventional techniques are harder to implement and make it harder for the consumers of the results of such tests to understand the results and to take correct actions. Worst of all, conventional techniques waste enormous amounts of precious user data making it harder to react quickly.
Counter-intuitively, while Bayesian techniques for real-time learning are based from complex mathematical theories that have only recently been fully understood, these techniques are actually very simple conceptually, are much easier to implement correctly than conventional statistical approaches and produce results that are much easier to understand, especially for non-statisticians.
In this talk, the audience will learn the basic ideas behind effective real-time learning, but also see detailed implementation techniques and learn how to architect effective testing systems. I will also cover methods for starting gently without massive system upheavals and how to build consensus around how real-time learning and optimization. The focus throughout the talk will be on practical methods that can be applied in real life.