Mason is the vice president for Technology Research at IRI, a 30 year old Chicago-based company that provides information, analytics, business intelligence and domain expertise for the world’s leading CPG, retail and healthcare companies.
“I’ve always had a love of mathematics and proved to be a natural when it came to computer science,” Mason says. “So I combined both disciplines and it has been my interest ever since. I joined IRI 20 years ago to work with Big Data (although it wasn’t called that back then). Today I head up a group that is responsible for forward looking research into tools and systems for processing, analyzing and managing massive amounts of data. Our mission is two-fold: keep technology costs as low as possible while providing our clients with the state-of-the-art analytic and intelligence tools they need to drive their insights.”
Recent challenges facing Mason and his team included a mix of business and technological issues. They were attempting to realize significant cost reductions by reducing mainframe load, and continue to reduce mainframe support risk that is increasing due to the imminent retirement of key mainframe support personnel. At the same time, they wanted to build the foundations for a more cost effective, flexible and expandable data processing and storage environment.
The technical problem was equally challenging. The team wanted to achieve random extraction rates averaging 600,000 records per second, peaking to over one million records persecond from a 15 TB fact table. This table feeds a large multi-TB downstream client-facing reporting farm. Given IRI’s emphasis on economy, the solution had to be very efficient, using only 16 to 24 nodes.
“We looked at traditional warehouse technologies, but Hadoop was by far the most cost effective solution,” Mason says. “Within Hadoop we investigated all the main distributions and various hardware options before settling on MapR on a Cisco UCS (Unified Computing System) cluster.”
The fact table resides on the mainframe where it is updated and maintained daily. These functions are very complex and proved costly to migrate to the cluster. However, the extraction process, which represents the majority of the current mainframe load, is relatively simple, Mason says.
“The solution was to keep the update and maintenance processes on the mainframe and maintain a synchronized copy on the Hadoop cluster by using our mainframe change logging process,” he notes. “All extraction processes go against the Hadoop cluster, significantly reducing the mainframe load. This met our objective of maximum performance with minimal new development.”
The team chose MapR to maximize file system performance, facilitate the use of a large number of smaller files, and take full advantage of its NFS capability so files could be sent via FTP from the mainframe directly to the cluster.
They also gave their system a real workout. Recalls Mason, “To maximize efficiency we had to see how far we could push the hardware and software before it broke. After several months of pushing the system to its limits, we weeded out several issues, including a bad disk, a bad node, and incorrect OS, network and driver settings. We worked closely with our vendors to root out and correct these issues.”
Overall, he says, the development took about six months followed by two months of final testing and running in parallel with the regular production processes. He also stressed that “Much kudos go to the IRI engineering team and Zaloni consulting team who worked together to implemented all the minute details needed to create the current fully functional production system in only six months.”
To accomplish their ambitious goals, the team took some unique approaches. For instance, the methods they used to organize the data and structure the extraction process allowed them to achieve between two million and three million records per second extraction rates on a 16 node cluster.
They also developed a way to always have a consistent view of the data used in the extraction process while continuously updating it.
By far one of the most effective additions to the IRI IT infrastructure was the implementation of Hadoop. Before Hadoop the technology team relied on the mainframe running 24×7 to process the data in accordance with their customers’ tight timelines. With Hadoop, they have been able to speed up the process while reducing mainframe load. The result: annual savings of more than $1.5 million.
Says Mason, “Hadoop is not only saving us money, it also provides a flexible platform that can easily scale to meet future corporate growth. We can do a lot more in terms of offering our customers unique analytic insights – the Hadoop platform and all its supporting tools allow us to work with large datasets in a highly parallel manner.
“IRI specialized in Big Data before the term became popular – this is not new to us,” he concludes. “Big Data has been our business now for more than 30 years. Our objective is to continue to find ways to collect, process and manage Big Data efficiently so we can provide our clients with leading insights to drive their business growth.”
And finally, when asked what advice he might have for others who would like to become Big Data All Stars, Mason is very clear: “Find and implement efficient and innovative ways to solve critical Big Data processing and management problems that result in tangible value to the company.”
Originally published in Datanami:
Trevor Mason and Big Data: Doing What Comes Naturally
October 20, 2014