Ted Dunning and Ellen Friedman discussed their latest publication, “Time Series Databases,” with Mike Hendrickson, Vice President for Content Strategy at O’Reilly Media, during Strata + Hadoop World in Barcelona. This third O’Reilly book by the authors is a helpful guide for showing effective ways to collect, persist, and access large-scale time series data for analysis. It explores the theory behind time series databases as well as practical methods for implementing them using Hadoop and NoSQL tools.
Mike: So tell us, what’s in the book?
Ted: Recently, MapR contributed code for performance optimizations to OpenTSDB, and when run with MapR-DB, demonstrated ingestion throughput of over 100 million data points per second. But the technique to do this was not PhD-level work; it was really relatively simple. With good foundations and good tools, anybody could do it. So that fit perfectly with this series of small books that we’ve been publishing. The goal is to write something that anyone can consume—it’s digestible. This book is the latest in a series of pragmatic, practical topics. The book is not so much about machine learning, but it’s certainly a foundation for that topic. Our other books have been foundational as well.
Ellen: There’s particularly a growing interest in time series data and time series databases. I’m predicting that in 2015, there’s going to be an explosion of interest in this area. It’s not a new idea—it’s something that people have done before. But now you have to be able to do it at a larger scale, especially for sensor data that’s part of the Internet of Things. So people that haven’t been doing it before are now jumping into that, and that’s why we chose this topic.
A few people have asked us if it’s specific to MapR. It’s not—we advise people to use a NoSQL database, and so the things that we describe can be done on Apache HBase or on the MapR database, MapR-DB. The performance numbers that we quote—you would get those performance numbers from MapR. From HBase, you wouldn’t get quite the same performance. But everything else we describe can be done on both. The book is not about the analysis of time series data, but about how to collect the data and build a really efficient database, the design of the database, and how to use an open source tool like OpenTSDB to access the data in a very efficient way.
Mike: Your prior book was just two months ago—was that about anomaly detection?
Ted: Yes, we did a really fun book in the Practical Machine Learning series, called A New Look at Anomaly Detection. There have been a lot of developments recently in machine learning that make certain kinds of anomaly detection really easy to implement. In fact, I was able to write code for the examples and get them up and running in one afternoon. Anomaly detection is more and more important as people measure more and more things more widely.
Mike: Prior to that book, did you have another one in the series as well?
Ellen: Yes, that book was titled Innovations in Recommendation. It was about how to build a very simple but very powerful recommender, using some very simple techniques that Ted developed, and exploiting search technology for implementation. The ideas that are presented in the book are completely accessible, even if you’re not a person that has deep experience in it. It has a very playful introduction that starts with the premise that “I want a pony.” Most other people want a pony, too; what does that tell you about how to recommend things to people?
Mike: So you have one book coming in the future—are there more?
Ellen: Yes, there are actually about five other titles that we are potentially looking at to continue this practical machine learning series. We think people would like to hear more about which algorithms really matter. We plan to look not just at the particular algorithms that fit this pragmatic approach, but also at what are the kinds of decisions you would want to take in order to get the best return for the effort.
Do you have a topic that you would like the authors to pursue, or any comments on the current books in the practical machine learning series? Add your thoughts in the comments section below.