Data Science Machine Learning Forum
Saturday, November 7, 2015
The Science Association is a non-profit professional group that offers education, professional certification, a "Data Science Code of Professional Conduct" and conferences / meetups to discuss data science (e.g. predictive / prescriptive analytics, algorithm design and execution, applied machine learning, statistical modeling, and data visualization).


Deep Learning for High Performance Time-series Databases

Jim Bates View Bio

Recent developments in deep learning make it possible to improve time series databases. I will show how these methods work and how to implement them using Apache Mahout. Systems such as the Open Time Series Database (Open TSDB) make good use of the ability of HBase, MapR tables and related databases to store columns sparsely. This allows a single row to store many time samples and allows raw scans to retrieve a large number of samples very quickly for visualization or analysis. Typically, older data points are batched together and compressed to save space. At high insertion rates, this approach falters largely because of the limited insert/update rate of HBase. In such situations, it is often better to short segments of data and insert batches that span short time ranges rather than inserting individual data points. When inserting compressed batches in this fashion, there are a number of obvious strategies that can be used. General compression utilities such as gzip do not normally provide particularly high compression rates. Bespoke crafted compression systems may provide point solutions with high compression rates, but they are generally fairly time-intensive to develop. I will describe how deep learning and sparse-coding techniques can be used to build systems that have very high compression levels (50x or more is typical) and which have the very interesting property that the resulting compressed data can often be queried or analyzed directly without ever decompressing the data. Moreover, it is possible to selectively decompress signals only from desired time ranges within a compressed batch. These new techniques for building time series data bases enable some exciting capabilities. The benefits include the ability to do query push-down into the time-series database from systems like Apache Drill, better visualization systems, and the ability to build an interesting form of anomaly detector on top of the time-series database. I will describe how to build these systems using Apache Mahout and illustrate the results with several real example


Jim Bates

Jim brings over 12 years of experience in systems and support engineering to his role as a Senior Systems Engineer at MapR. Jim was previously a Senior Systems Engineer for WANdisco. Prior to that, Jim was a Senior Systems Engineer for Spirent Communications, a multinational telecommunications test company. Jim also held various roles in systems engineering at Mu Dynamics. Earlier in his career, Jim held engineering positions at Tellabs and Advanced Fibre Communications. He began his career as an Officer in the US Army, where he designed and implemented mobile routing networks and jumped out of airplanes with Cisco routers to get the internet working over single channel tactical satellite networks. In addition to his technical roles, Jim has worked as a house framer, a fireman, a janitor, an infantryman, and a paratrooper. His very first job was that of a farmer and rancher, working alongside his father, a time during which he learned his most important life lessons. When not working for MapR, Jim enjoys carpentry work, welding, kayaking, hiking, and spending time with his family.

Jim holds a BS/EE in Electrical Engineering from Texas A&M University.