Banks are among the many businesses taking advantage of big data and IoT opportunities, including for mobile payments, online banking, and smart kiosks, but the huge quantities of personally sensitive data from these activities must be protected at all stages. Big data security – especially for data in motion - is a giant challenge because fraudsters keep inventing new ways to attack valuable data. To keep ahead of the bad guys, organizations need to be innovative, constantly designing and deploying new types of large-scale predictive models. And banks are not the only businesses to face this challenge – health care companies, insurance companies, and indeed any merchant who handles personally identifiable information (PII) for their customers must also keep up.
The good news is that there are many highly experienced machine learning experts and many new techniques and tools to support effective analytical models that identify potentially fraudulent transactions, identity theft, and phishing attacks. But not every organization has their own resident team of machine learning experts to do this work in house. It may be necessary to bring in outside experts to help you build agile, effective models to catch the fraudsters. Yet you may not want to expose your most sensitive data. Anomymizing PII with guaranteed safety is very hard to do. How then can you take advantage of the help of expert outsiders without putting your most sensitive data at risk?
Figure 1: The difficulty in consulting help from outside experts to work with very sensitive data. (Image © 2015 Friedman and Dunning, used with permission).
At a recent Big Data Everywhere conference in New York City, MapR Chief Application Architect Ted Dunning described a powerful new method to do just this. He explained a technique used to break open a dramatic fraud case involving single point of compromise, a technique that is generalizable to a wide variety of other situations. Here’s how it works.
Finding the Compromised Merchant
One of the new trends amongst fraudsters is to execute many small fraudulent transactions using the stolen personal information for many hundreds of thousands of customers. The result is to steal millions of dollars (or pounds or euros) in a very short time, thus flying under the radar. Fraudsters do this by getting access to personal financial information for a huge number of customers through a compromised merchant or website. Rather than stealing a credit card and using it for big purchases - a behavior likely to be detected by current security software – the fraudsters can set up fake businesses and execute a lot of purchases using the stolen card numbers. They count on these small purchases going mostly un-noticed or ignored by the people whose accounts are being abused.
Faced with the challenge of detecting and stopping potential fraud resulting from a compromised merchant, a large financial institution that is a MapR customer consulted with MapR for help in building new models to detect this type of distributed attack. Their goals were to improve their fraud detection capabilities in order to (a) detect more suspect events and (b) to detect them faster, thus limiting risk by closing affected accounts before much had been lost. An additional benefit of this approach would be to reveal a trail of clues that lead back to the merchant where a breach occurred.
The bank had large-scale behavioral data for merchant transactions based on individual users. The approach Ted took was to convert the transaction data into timelines for each customer and then look for which merchants were involved before known fraudulent events occurred. That way he could determine the relative likelihood that each merchant was the common point of compromise, assigning a breach score. But the problem was that the bank could not share the sensitive data with an outsider, even for this purpose.
To overcome this problem, Ted wrote customized extensions to a log-synth program he previously developed and provided as open source. The extensions enabled log-synth to generate fake user histories with invented merchants in order to carry out common point of compromise simulations. In the experiments on simulated data, the compromised merchant popped out against the background noise as having a very high breach score.
Having built and tuned this model, it was then passed to the customer to use inside their security perimeter with real transactional data. The result of this real-world analysis was dramatic: as in the case of the simulated data, one merchant in particular stood out against the rest, with a breach score over 80, as shown in Figure 2. The bank checked with the Secret Service, who determined that the merchant was indeed the source of a massive data breach.
Figure 2: A model for fraud detection developed and tuned using simulated data was then applied to real data with dramatic results. The financial institution doing this work had the advantage of help from outside experts without having to expose their sensitive data. (Image © 2015 Ted Dunning, used with permission).
A Better Method for Data Simulation
The use of synthetic data is not a new idea, although the advantages of doing so are often overlooked. The method Ted described at Big Data Everywhere NYC involves a new twist that makes it much easier to generate data appropriate to simulate a real-world situation. Ted found that instead of trying to match the characteristics of real data exactly – a particularly hard thing to do – it was only necessary to match performance parameters between fake and real data. The parameters must of course be chosen well to be realistic and important, but this twist makes the approach both powerful and easy.
Even better: this method works in cases beyond fraud detection. For details on how to work with the open source log-synth as well as other real-world use cases, read the new short book from Ted Dunning and Ellen Friedman, Sharing Big Data Safely: Managing Data Security. It’s available as a free download from MapR.