Apache Apex on MapR Converged Platform

Editor's Note: This guest post was co-authored by Thomas Weise

In today’s world of immense competition and customer churn, Telecom providers are reinventing and transforming their businesses in order to provide their customers with the best possible customer care and satisfaction. The primary goal is to minimize this churn and increase customer lifetime value. The ability to connect structured and unstructured sources of data, combined with the power of a converged data platform, provides unprecedented insight to Telecom providers, who are then able to better serve their consumers.

However, there are significant technical and operational challenges that must be overcome in order to accomplish this. Additionally, to be able to run it in production 24/7, at scale, in a fully fault-tolerant manner, adds more layers of complexity. This is where most companies are brought to their knees. Fortunately, the MapR Converged Data Platform, combined with Apache Apex, can address these challenges.

In this blog post, you’ll learn about three key aspects:

  1. A Telecom use case where Call Data Records (CDR) are received by MapR Streams and sent to Apache Apex core processing framework for analytics like de-duping and dimensional compute, and then are sent to the DataTorrent Visual console for viewing the information.
  2. Details on the two technologies, which are extremely complementary in terms of the technical strengths and architectural advantages over others.
  3. An introduction to Apache Apex and its focus on an enterprise-grade, fully fault-tolerant data processing engine, which provides an architecture that is robust and can scale to meet the stringent needs in Telecom and in other mission-critical use cases.

Telecom Use case

In the Telecom industry, acquiring customers is expensive, but churn is even more expensive. Telecom companies lose customers due to the following reasons:

  • Dropped calls
  • Lack of network coverage, resulting in poor customer experience
  • Bandwidth issues
  • Poor download times
  • Inordinate service wait times
  • Poorly trained customer service reps
  • Inadequate call center staffing

Analyzing these trends in real time and using data from different sources from both streaming and static sources is the key to gaining insight on the operational effectiveness of the network, and reacting in a timely manner to impact customer satisfaction and success.

This use case illustrates a scenario to address the above issues. There is a continuous stream of customer Call Data Records (CDR) and support call center statistics. The provider wants to provide a better customer experience by proactively evaluating performance and taking corrective actions. Analytics include monitoring of dropped calls, bandwidth usage patterns, service wait times across different service centers, tracking of customer satisfaction upon service call completion, etc.

apex telecom

In this scenario, Apex uses MapR Streams to receive the Telecom CDR records and call center statistics, which are then processed and enriched, and various metrics for multiple dimensions are computed and stored. The dashboard visualization works directly off the Apex application without the need to write the results to an external store. The data visualizations continually update in real time, as new data is processed. Dashboard widgets also support user-defined queries, such as displaying all dropped calls or service wait times for selected region.

Since the application solely relies on Apex and transitively on components of MapR (Streams, YARN, MapR-FS), it can leverage the scalability, performance, and operability of the underlying infrastructure.

In this specific use case, CDR data coming from MapR Streams is being ingested by the Input Operator, and then subsequently the data is enriched. The next stage in the pipeline is data is tagged by Geos and the relevant KPIs are calculated. After this calculation, CDR data and the Geo tags are stored in MapR-FS. The final output is then displayed in a visualization UI, which shows the key metrics to users so that they can take appropriate business action.

MapR Streams connects data producers and consumers worldwide in real time, with unlimited scale. Publishers (data producers) write data to one or more topics in MapR Streams. Subscribers (data consumers) to the topic can read the data instantaneously. It’s important to note that MapR Streams is the only big data-scale streaming system that’s built into a converged data platform.

apex telecom

Apex and MapR Converged Platform

MapR has traditionally focused on high performance and enterprise readiness, with the file system (MapR-FS) as an important foundation. As new components are being added around it, the converged platform becomes a complete, integrated stack that covers all the infrastructure needs for big data applications. The most recent example is the addition of MapR Streams, which adds the missing messaging piece to the puzzle. At the very core are unique capabilities such as cross datacenter geo-distributed replication and failover. There are key themes which resonate with RTS and Apex, making it an ideal choice to complement the application framework layer:

  • Focus on fault tolerance, high availability, high performance and SLA support. Focus on enterprise-grade operability.
  • Direct support for MapR Streams in Apex via Kafka 0.9 API support. Jointly developed, certified, and benchmarked with MapR. An example project can be found here.
  • DataTorrent RTS adds the capability to build real-time visualization for use cases that are MapR-focused and provides the management tool.
  • Apex is compatible with MapR-FS through the Hadoop File System interface and the support was certified.

Introduction to Apache Apex

Apache® Apex (http://apex.apache.org/) is a data-in-motion processing platform that helps to unlock the potential of Hadoop by providing a framework for application development that enables more use cases. Apex comes equipped with essential capabilities for low-latency processing of unbounded data, horizontal scalability, high availability, operability, and the ability to integrate with the existing enterprise infrastructure through a comprehensive library of connectors and functional building blocks.

Apex development started in 2012 at DataTorrent, and it was built to run natively on Hadoop. The compute resources are scheduled and processes are managed through YARN, and the Hadoop file system is used to checkpoint its state. With its Hadoop native architecture, Apex can take full advantage of the underlying infrastructure and can pass on the benefits to the user.

Hadoop infrastructure originally only supported MapReduce as an application framework, and this was limiting use cases and adoption. With the introduction of Hadoop 2.0, there are alternatives and new frameworks that provide greater flexibility and capabilities to address a broader set of use cases. Now, an ever increasing number of alternatives is aiming to fill this gap in the wider big data ecosystem. Apache Apex has an engine for fast and scalable in-memory stream processing. It caters to real-time and batch processing use cases. Apache Apex aims to deliver delivers superior performance, enterprise readiness, and a low barrier to entry. Some of the key value points are summarized below:

  • Fault tolerance and high availability: Apex guarantees no loss of data and computational state and exactly-once semantics. In the event of failure, automatic recovery will restore state and resume processing.
  • High performance, low-latency stream processing engine: suitable for low millisecond SLA requirements.
  • Scalability: provides advanced partitioning schemes and configuration- driven platform behavior that don’t require a rewrite of the business logic.
  • Java-based: accessible to a widely available application development skillset and ecosystem of third party software. The API is easy to adapt for Java developers and allows reuse of existing functionality.
  • Separation of functional and operational specification: no paradigm restriction (like MapReduce) from a platform perspective. This significantly improves the success rate of a big data project.  
  • A broad library of ready-to-use building blocks: integrates well with other technologies that solve problems in adjacent spaces (databases, messaging, etc.).

apex telecom

DataTorrent RTS provides components on top of Apex to enhance the user experience. For monitoring, management, and debugging of Apex applications, the management console (dtManage) provides a full-featured GUI. It is powered through an open and certified REST API, which is part of the management service (dtGateway). It extends the operability focus of Apex with a frictionless installation process, simplified administration, and comprehensive security support with a variety of authentication mechanisms, plus RBAC.

The other important component in the RTS offering is a data visualization framework (dtDashboard) that is designed to fill another gap for a complete user experience in Hadoop. The user can define dashboards that visualize the data as it is being processed by Apex applications.


To conclude, Apache Apex is an enterprise-grade application framework for stream processing and analytics. It is used in production in large fortune 100 companies, in mission and business-critical applications. The MapR Converged Data Platform (infrastructure layer), combined with Apache Apex (the application layer), can provide compelling advantages for use cases where high performance, high availability, and no data loss are must-have requirements.



Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams
Learn about new designs for streaming data architecture that help you get real-time insights and greatly improve the efficiency of your organization.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free