This is was origionally posted on The HIVE on May 12, 2014.
Recently I happened to observe martial arts agility training at my son’s Taekwondo school. The ability to move quickly, change direction and still be coordinated enough to throw an effective strike or kick is the key to many martial arts, including Taekwondo.
It is interesting to observe how agility applies to the various components of an enterprise data architecture stack and BI/analytics environment, especially in the new world of big data.
Business Intelligence and analytics tools have gone through a dramatic transformation in the past five years. The old environment was highly centralized and governed, where a few IT production reports for known questions were pushed out to a broad range of business users to help with strategic decisions. The original BI tools provided some analytic capabilities such as parameterized reports and OLAP/ad hoc query, but they were still fairly static. If business questions were changed, users had to wait for weeks if not months for their BI counterparts to provide them with modified or new reports. The new wave of tools such as QlikView, Tableau, and Spotfire made this analytics process very agile. Gartner’s magic quadrant for BI/Analytics systems highlighted the fact that since 2013, more and more organizations have increasingly displaced the traditional BI platforms and have moved toward business user-driven data discovery, where users can iterate the data for changed/new questions as long as the data is made available to them by the IT department.
On the data management side, the status quo of relational databases has been challenged as well, with new types of applications on the market such as social, mobile, cloud and sensors from the IOT (Internet of Things). These applications not only had to scale to accommodate more users and more data than traditional transactional applications, they also had to be very iterative in nature. Soon enough, organizations started to complement their RDBMS environments with Hadoop as well NoSQL systems such as Mongo and HBase. In addition to the significant cost benefits at scale, these systems provided the needed agility to support fast-changing applications by storing data without having to define schemas upfront (i.e. application-driven schemas). It is important to note that it’s often the case that data in Hadoop/NoSQL systems is self-describing and/or semi-structured in nature (e.g.: JSON, Parquet and key-value formats).
So, what is missing in today’s environment? Hasn’t the agility problem been solved? Why do we hear organizations still talk about reports/analytics that take weeks and months to complete? Let’s connect the dots. The BI/analytics tools and data acquisition processes have become agile, but the missing piece of the puzzle is the layer that glues these together; the ETL process is not agile enough and has not changed in the past few decades.
ETL (Extract-Transform-Load) is the well-defined process that takes source data coming from operational application databases, transforms it, aggregates it, and moves it into an Enterprise Data Warehouse (EDW), where the BI/analytics tools then leverage the data. The fact that ETL can take too long or is difficult to change is not a new problem, given the well-known ‘ALTER’ TABLE’ complexity and the number of steps involved in physically moving the data. However, the problem worsens with big data. The semi-structured/nested data often found in big data applications is hard to model with relational paradigms, and managing centralized EDW schemas when data is changing fast can quickly become an overhead issue. Additionally, moving data when it is flowing in real time can be a challenge. Even if organizations are willing to commit resources to manage this process, using ETL for non-repetitive/ad hoc queries and short-term data exploration needs (compared to questions or KPIs that companies want to measure and monitor repetitively) may be unnecessary. SQL-on-Hadoop technologies can help with the ETL problem to some extent by not having to move the data to another system, but the bigger pain point still remains: how do you model and manage schemas for big data?
Overall, a new way of thinking is needed in order to bring end-to-end agility to BI/analytics environments that utilize Hadoop/NoSQL systems. In addition to retaining the table stakes requirements needed to support the broad ecosystem of SQL tools, close attention must be paid to new analytics requirements, such as working with fast-changing data models and semi-structured data, and achieving low latencies on a big data scale.
Please join us for an informative talk on May 14th, where MC Srivas from MapR Technologies will discuss how Apache Drill is driving this audacious goal of delivering SQL capability natively on Hadoop/NoSQL environments, without compromising the flexibility of Hadoop/NoSQL and the low latency needs for BI/analytics. You’ll also hear about the exciting challenges that the Drill community is working on, as well as a progress report and project roadmap review. Here are some topics that will be covered in the discussion.
• Can I reduce the time to value for my business users on Hadoop data?
• How can I do SQL on semi-structured data?
• How do you use runtime compilation when you don’t know schema at planning time?
• How does columnar execution apply when using complex data types?
• When moving from traditional MPP scale to Hadoop scale, what distributed-system problems do you have to solve?