Kenshoo optimized its data architecture and added new capabilities by integrating the MapR Distribution to their RDBMS-based installations. As a result, they gained multiple benefits from their MapR solution including new services capabilities, increased performance, better reliability and cost savings.
Kenshoo provides digital marketing software and predictive media optimization
technology for search marketing, social media and online advertising. Kenshoo
powers campaigns in more than 190 countries for nearly half of the Fortune 50
and all 10 top global ad agency networks.
Companies, agencies and developers direct more than $200 billion in annualized client sales revenue through the platform which includes products like Kenshoo Search, Kenshoo Social, Kenshoo Local, Kenshoo SmartPath, and Kenshoo Halogen. Kenshoo clients include Expedia, Facebook, KAYAK, Omnicom Media Group, Sears, Starcom MediaVest Group, Tesco, Travelocity, Walgreens, and Zappos. Kenshoo has more than 24 international locations.
Kenshoo was managing over 100TB of data and risked overwhelming their production
systems when running the large queries they needed.
“We’re in a very competitive market. A lot of companies are working with machine learning and predictive models. We want to take it to the next level for our clients, to give them flexibility to run more sophisticated reports faster,” says Noam Hasson, team leader for Big Data at Kenshoo. “With the RDBMS we had in place, we were limited with the kind of queries we could run and the scaling costs were too expensive.”
Kenshoo had several goals for integrating Hadoop into their architecture. They wanted to let their clients run heavy analytics queries without impacting their production server performance. They wanted to lower the response time for their client analytics requests, especially for the very large queries. And for the clients who run campaigns on multiple installations, they wanted to provide an easier way for those clients to run comprehensive queries that spanned their multiple installations. On the Kenshoo side, they wanted to easily compare information across all installations and identify which information gets fed where, to see how they could continue optimizing their system.
Hasson explains that they were able to use Apache Sqoop and Apache Hive to
help integrate the MapR Distribution including Hadoop into their company in
less than a week, without needing to write a single line of code.
Kenshoo’s solution was to use Sqoop to run a full table import into Hadoop, without complicated ETL tools, and then run Hive queries. This was easy because Sqoop imports the data in a Hive-friendly format, and even loads it into Hive’s metastore. One of Hive’s biggest advantages is that it can be used to manage the entire metastore of data contained in Hadoop. This means that even regular MapReduce commands are run through an organized system of databases and tables.
Kenshoo saw Sqoop’s biggest advantage to be that it processes very fast queries that barely slow down the servers, since it works with the primary key. It also reads in bulk, which greatly improves I/O speed. In addition, it works with JDBCand doesn’t require structural changes, plug-in installation, NFS permissions, or copying. There was no added complexity, it was a very quick, simple solution for them. At the same time, they had the flexibility to easily adjust the system to use more resources to load faster, or to use fewer resources to load when the system is otherwise busy.
For some initial performance testing, Kenshoo used a single node with 2 CPUs, 12 cores, 32 GB memory, and 12 4-TB hard drives. They were able to import data from a large table with 300 million rows in hours, without overloading any of their systems. They also ran a “select count” Hive query on 5.5 billion rows, which only took 90 minutes on that single node. Considering that they could not run the same query on 5.5 billion rows on their existing RDBMS, this was a big validation. Finally, they ran a query with a “group by” clause on that 5.5 billion-row table, which returned in 18 hours. Though they were happy with the numbers, this was still all pre-optimization, which meant they could optimize the system and even add more servers to the cluster for larger loads and faster responsiveness.
New services capabilities
Kenshoo is able to provide entirely new services to their clients. “There is a benefit to every department,” says Hasson. “We can support bigger customers, perform new kinds of algorithms, and run more complicated models. Our research department can start asking whole different types of questions. There is not a question too complicated when you start working with Hadoop. The value of Hadoop is priceless because now we can do things that were impossible before.”
Kenshoo is seeing dramatic improvements in performance. “Some reports that used to take a few days can now be run in Hadoop in a few minutes. It’s 10-15 times faster than the traditional RDBMS,” says Hasson. “And we are spending a few thousand dollars compared to hundreds of thousands for other analytic products. We get very good results for very low cost.”
Hadoop helps Kenshoo aggregate data from many sources. And with the volumes feature unique to MapR, Kenshoo was able to create distinct logical partitions in Hadoop that enabled separation of data from the distinct installations. This helped them manage the imports, so they could efficiently do a full import for each server and simultaneously run queries on them without interfering with other workloads.
Kenshoo also can feel far more secure with the reliability and availability of their data with their MapR solution. “Hive can work with very long queries, and unlike a RDBMS that might crash mid-run, it can keep on going,” says Hassom. “Another big advantage is that by using MapR, even if Hive does crash, it picks up where it left off.”