Atlanta Hadoop Users Group April 2015 - Secure Machine Learning Using Synthetic Data
Tuesday, April 21, 2015
The Atlanta Hadoop Users Group is a group of Hadoop & Cloud Computing technologists / enthusiasts / curious people who discuss emerging technologies, Hadoop & related software development (HBase, Hypertable, PIG, etc).


How Fake Data Can Solve Real Problems and Enhance Security

Ted Dunning View Bio

Open source is great, but to work it has to be developed in the open. Privacy is great, but things have to be kept private. So what happens when you find a bug in open source software that only happens when you run it with private data? How do you file the bug? Build a test case? Work with collaborators? Or what happens when you have a killer machine learning system, but you can't prove its worth to your potential customers because you aren't allowed to see their confidential data? The best answer often is to make fake data that looks real enough to exhibit all of the bug-making or algorithm confounding properties of the real data. That is, you have to make fake data that seems so real that it fools the bug. To do that you may need some pretty elaborate sleight of math. I will describe log-synth, an open-source program for generating realistic fake data. Log-synth can make up names, addresses or sample from realistically perverse numerical distributions. Using it, you can build data sets that can join cleanly but that have long-tailed frequency distributions. You can build fairly realistic session histories. And if log-synth won't do what you need, it is very easy to extend. I will also describe physics-based approaches for emulating sensor data. The first use of log-synth was to demonstrate a bug in Hive where joining 10 billion facts against 32 dimensions caused the query optimizer to fail. I will describe what happened and how we found and fixed that bug by generating fake (but realistic) data to take the place of the customer's highly confidential dataset. Another use for log synth was where the emulation of a merchant compromise scenario allowed open source development and testing of an algorithm that later worked without change on live data. And which, by the way, found some bad guys. The audience for this talk will learn how to use this open source tool and innovative approach to solve a problem that arises in a wide variety of use cases.


Ted Dunning

Ted Dunning is Chief Application Architect at MapR Technologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects​. Ted has been very active in mentoring new Apache projects and is currently serving as vice president of incubation for the Apache Software Foundation​.​ Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (later purchased by LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting..