Hadoop Proved an Early Big Data Win for Morgan Stanley

Editor's Note: This entry has been updated since its original publication date of December 15, 2015.

Considering big data techniques, Hadoop-based approaches were among the first to be widely recognized and widely used, but Hadoop is just a part of modern big data solutions. Evolving technologies offer a wide range of capabilities that include distributed file storage, NoSQL databases, data stream transport and stream processing, search, SQL-on-big-data, machine learning, and more. The potential benefit of big data approaches already is clear—they have been delivering practical value for some time now. 

To understand how Hadoop and other big data technologies provide value, it’s always useful to hear about real-world experiences that take us beyond the theoretical. That’s what came to light about Morgan Stanley security in an interview with Erwan Le Doeuff, then VP of Information Technology, Risk, and Security for Morgan Stanley in New York City. Was big data paying off? The answer was a resounding “yes.”

What is Morgan Stanley’s big data and Hadoop story?

The story starts with the security product team responsible for custom development and integration, formerly headed by Erwan Le Doeuff. He also worked with the security data repository at Morgan Stanley that runs on a Hadoop-based big data platform. The group started out by using the distributed storage layer of this big data platform as an archiving system so they could retain a large volume of structured and unstructured data in an efficient way.

Not only was there a need to persist large amounts of data such as event logs in this archive, but the team also needed to have convenient access to the data. With this need in mind, the system was built with efficient search capability. And given the sensitive nature of the data, strong security controls were in place from the start. Security included audit capabilities, being able to detect tampering, and being aware of who was looking at what.

Once the data repository was successfully established, other groups at Morgan Stanley began to see ways in which their work could benefit from using it. Over time, the uses for this system expanded to include more sophisticated computations and analytics including extracting and analyzing data for reports. A convenient aspect of the big data platform they selected is that operational applications, analytics, and data storage can all be carried out on the same cluster. Soon the team moved forward to making use of machine learning approaches as well. 

MapR - Big Data Platform

Figure: Big data use case evolution. When you first begin to store data from a variety of sources on the distributed storage layer of the big data platform, you may not yet know all the ways in which different groups will want to use that data. That’s one of the beauties of this big data system: the flexibility lets you expand uses easily. 

Big wins with Hadoop ecosystem tools on big data platform

Win #1: Archive more data for longer term

Immediately there was a win for Morgan Stanley based on Hadoop’s ability to scale, particularly in a cost--effective way. They not only needed to collect large amounts of data but also to save it for longer times. This approach set them up to handle their current projects and to be ready for future needs. One reason is the advantage that big data offers in fighting cyber threats.

Working with a large volume of data from many sources and having the option to look back over a period of three to five years, it is possible through sophisticated anomaly detection models to identify known and new cyber threats. This type of use case is not unique to Morgan Stanley. In an eSecurity Planet article titled “Using Hadoop to Reduce Security Risks” MapR’s Chief Application Architect, Ted Dunning, explained how big data technologies improve an organization’s ability to detect attacks and reduce risk.

The benefit based on efficient scaling is growing rapidly, both as the amount of security-related data grows and as a fresh range of cyber attacks continue to appear. Erwan reported that the amount of data of interest that is generated may increase by up to ten times in just two years. With this increasing pressure from data volume, it’s important to be able to scale out easily, quickly, and in a cost-efficient way.

Win #2: Multiple data centers

In addition to scalability, it’s also important for the data platform to support reliable work with high availability across multiple data centers, particularly when working with highly valuable data. This ensures that the data will be always be available even if something happens to one center. The platform Morgan Stanley chose was well designed to meet these requirements.

Win #3: Flexibility

One of the wins that Erwan reported is the flexibility that big data can offer. This flexibility benefit shows up in several different ways. In terms of sizing the cluster, it is not necessary to know in advance exactly what size system you will need. It’s easy to scale as the need arises. Erwan explained that they started with a four-node cluster and expanded to twenty-four nodes for the security projects he led.

Flexibility is also a benefit in not having to know exactly what you will do with data in future applications. At Morgan Stanley they collected security data that needed very little data modeling before it was stored in a reliable way on the big data platform.  With good options for search and extraction, the data could subsequently be used in a variety of ways.

This freedom through flexibility also paid off in that their work is a mix of known roadmaps and having to be ready for new situations that cannot be defined in advance—that’s natural when working with security and threat prevention. You don’t always know ahead of time what the bad guys are planning, so you need to be able to respond to threats quickly and in creative ways.

When is the right time to start?

The Morgan Stanley story reflects the experience of many other organizations that started early with Hadoop. Their ways of using big data and emerging technologies evolved, but as Erwan said, “…you have to start somewhere.” There is an advantage to getting started—you begin to build big data experience and to build your repository of critical data. That starts the clock on giving you the years of insight you may need for some situations. 

What is Morgan Stanley likely to encounter in the future? Likely this will include working with petabytes of data in real time, using a variety of big data ecosystems tools. 

Additional Resources



Ebook: Getting Started with Apache Spark
Interested in Apache Spark? Experience our interactive ebook with real code, running in real time, to learn more about Spark.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free