Achieving Data Integration Success with Hadoop

Organizations have struggled with critical performance and scalability shortcomings of conventional data integration for years, leading many to push heavy data integration workloads down to the data warehouse. As a result, core data integration experienced a shift from extract, transform, and load (ETL) to extract, load, and transform (ELT).

Although this worked in the short term, it has also created a whole new set of problems for the IT organization with the onset of Big Data. Nightly processing extends far beyond its window, causing resource contention and delaying functional reporting. Data retention periods are drastically cut, causing missed analytical opportunities and operational insights. Even worse, database costs are spiraling out of control merely to keep the lights on. The need to analyze more data from a more diverse set of sources in less time, while keeping costs reasonable, is straining existing data integration architectures.

Not only is ELT a Band-Aid solution that costs more and offers less, but it is also a hindrance to creating an effective Data Governance strategy.  In a Big Data world, this is no small matter. Organizations’ success – or failure – will be determined by how well they choose technologies and solutions to address data integration, security, data lineage and auditing throughout the lifecycle of data.

Hadoop is a promising alternative

Many of our customers are turning to Hadoop to relieve the tension between the evolving needs of the business and the growing costs of IT infrastructure. Hadoop is not only economically feasible, but also provides the required levels of performance and massive scalability. For these organizations, Hadoop is quickly showing its potential as the ideal data hub to store and archive all structured and unstructured data. It can then be processed directly on Hadoop and distributed to other pieces of the IT infrastructure. By effectively offloading data and ELT workloads from the data warehouse into Hadoop, organizations can significantly reduce nightly batch processing, retain data as long as they need, and free up significant data warehouse capacity.

There may be some challenges along the way

Out of the box, Hadoop offers powerful utilities and massive horizontal scalability; but does not provide the set of functionality users need to deliver enterprise ETL capabilities. Offloading data and ELT workloads to Hadoop forces users to find the right tools to close functional gaps that exist between enterprise ETL and Hadoop. Where do you begin? How do you know which workloads to move? Do you have all the tools necessary to access and move your data and processing? How do you keep up with the ever-evolving Hadoop ecosystem? How can you optimize processing once it’s inside Hadoop?  These challenges can get intimidating very quickly.

Three-step approach to offloading ELT to Hadoop

Now, I don’t think the data warehouse is going away anytime soon. The goal of offloading is to free up database resources to reduce costs, improve query response time, and use the premium database resources more wisely. To that end, our customers have addressed the challenges noted above and found success following a three-step approach.

  1. Identify which workloads to offload: We have heard from partners and customers alike that, in many cases, “cold” data can waste significant premium storage in your data warehouse, yet it adds zero value. Similarly, heavy transformations -- including change data capture (CDC), slowly changing dimensions (SCD), ranking functions, volatile tables, multiple merge, joins, cursors, and unions -- can waste up to 80 percent of CPU, IO and storage resources.  Syncsort has not only the tools, but also the expertise to guide you along the way.
  2. Move the data and replicate the existing ELT workloads into Hadoop: Hadoop is the ideal place to store large amounts of archive data and perform large batch operations due to its cost-per-terabyte and massive scalability.  However, many organizations have limited Hadoop-savvy developers in their organization making it crucial to utilize familiar tools that are Hadoop-ready.  Syncsort DMX-h can seamlessly translate ELT workloads into Hadoop, and with built-in Intelligent Execution, it will evolve along with the Hadoop Ecosystem, taking advantage of new technologies as they mature.
  3. Secure, optimize and manage Hadoop:  To truly have an enterprise-ready solution, data integration must comply with your data governance strategy. Once you’ve offloaded data and workloads from the data warehouse, you will need enterprise-grade tools to manage, secure, and operationalize the new environment. Here, it is important to look out for solutions that support common security standards such as Kerberos and LDAP, as well as monitoring and management tools.  Hadoop vendors such as MapR can help with your data governance strategy and Syncsort will automatically optimize the processing workload while conforming to that strategy.

The right tools help remove barriers

Syncsort provides targeted solutions to address the challenges of offloading data and workloads from the data warehouse to Hadoop. Our DMX-h enterprise software is deployed with the MapR Enterprise Data Hub to:

  • Connect to all data across an organization – not just from the data warehouse but also relational databases, files, CRM systems, web logs, social media, mainframes and legacy systems, new file formats such as JSON, etc.
  • Prepare data for analytics – cleanse, filter, reformat & translate your data, and load directly into Avro & Parquet without staging
  • Blend data for new insights – Enrich your data with in-flight blending
  • Transform data with “design once/deploy anywhere” technology – visually design transformations once and deploy to any platform or framework
  • Distribute data for the fastest path to insight – with the fastest load speeds for the EDW, or direct delivery to data visualization tools

Syncsort is a MapR certified technology partner with successful joint customers achieving success with Hadoop like comScore and Experian.  To learn more about Syncsort and our solutions, visit us at



Driving The Next Generation Data Architecture with Hadoop
This paper examines the emergence of Hadoop as an operational data platform, and how complementary data strategies and year-over-year adoption can accelerate consolidation and realize business value.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free