By using the MapR distribution for Hadoop, comScore is able to easily manage and significantly scale their Hadoop cluster, create more files faster, process more data faster, and produce better streaming and random I/O results than other Hadoop distributions – and do so confidently knowing they can count on data protection and disaster recovery functions when needed.
comScore is a global leader in digital media analytics and the preferred
source of digital marketing intelligence. comScore provides syndicated and
custom solutions in online audience measurement, e-commerce, advertising,
search, video and mobile. Advertising agencies, publishers, marketers
and financial analysts rely on comScore for the industry-leading solutions
needed to craft successful digital, marketing, sales, product development
and trading strategies.
comScore ingests over 20 terabytes of new data on a daily basis. In order to keep up with this data, comScore uses Hadoop to process over 1.7 trillion Internet and mobile events every month. The Hadoop jobs are run every hour, day, week, month and quarter, and once they’re done, data is normalized against the comScore URL data dictionary and then batch loaded into a relational database for analysis and reporting. comScore clients and analysts generate reports from this data; these reports enable comScore clients to gain behavioral insights into their mobile and online customer base.
The comScore engineering team processes a wide variety of Hadoop workloads and requires a Hadoop distribution that excels across multiple areas:
As comScore continues to expand, the Hadoop cluster needs to maintain performance integrity, deliver insights faster, and also needs to produce more with less to minimize costs.
comScore needs a Hadoop platform that provides data protection and high availability as the cluster grows in size.
comScore’s Hadoop cluster has grown to process over 1.7 trillion events a month from across the world, in the past comScore has seen increases of over 100 billion events on a month over month basis. Consequently, comScore needs a Hadoop platform that will enable them to maintain performance, ease of use and business continuity as they continue to scale.
Ease of Use
comScore needs things to just work, and operating the cluster at scale needs to be easy and intuitive.
MapR has been in continuous use at comScore for over two years. MapR has demonstrated superior performance, availability, scalability, ease of use, and significant cost savings over other distributions.
Across various benchmarks, MapR executes jobs 3 - 5 times faster when compared to other Hadoop distributions and requires substantially less hardware than other distributions.
MapR protects against cluster failures and data loss with its distributed NameNode and JobTracker HA. Rolling upgrades are also now possible with MapR.
With architectural changes made possible by it’s no NameNode architecture, MapR creates more files faster, processes more data faster, and produces better streaming and random I/O results than other distributions. comScore now runs more than 20,000 jobs each day on its production MapR cluster.
Ease of Use
comScore’s Vice President of Engineering, Will Duckworth said, “With MapR, things that should just work, just work.” This means there is a lot less for comScore to manage with MapR. One of the advantages that Duckworth cites is that everything is a data node. This configuration results in much better hardware utilization from his perspective. With MapR, it is easy to install, manage, and get data in and out of the cluster.
comScore is also able to use the MapR advanced capabilities to enforce parallel data allocation patterns. This enables key analyses to be performed using map-side merge-joins that have guaranteed data locality, resulting in a 10x increase in computation speed. “The specific features of MapR, such as volumes, mirroring and snapshots, have allowed us to iterate much faster,” said Michael Brown, CTO of comScore.