Editor's Note: If you're interested in learning more about how streaming data can give you a competitive advantage, be sure to read the free O'Reilly ebook, Streaming Architecture: New Designs Using Apache Kafka and MapR Streams by Ellen Friedman and Ted Dunning.
Actionable insights from real time analytics
That’s a goal for many new projects being designed to make use of streaming data, and it’s no wonder so many organizations are aiming at this prize. If you can develop programs to process streaming data with near or actual real time analytics, you gain the ability to react to life as it happens. In many cases, this ability to react in the moment gives you a great advantage. But it may surprise you to know that getting real time insights is just one of the benefits of adopting a streaming style approach to big data, albeit it’s a significant one.
When do real time analytics give you a competitive edge?
A wide variety of specific use cases demonstrate the time-value of data and the advantage of being able to analyze and react to events in a timely manner. Fraud detection and cyber security, for example, benefit from being able to recognize anomalous behavior as it happens and being ready to respond quickly to an attack. For example, if your system can detect a suspicious pattern of events during login on a banking website, you may be able to block the intruder. The as-it-happens big data approach improves your chance to shut down these types of theft before losses mount up. Reduced risk can result not only in significant financial savings from averted loss but also improve customer confidence and satisfaction, thus increasing customer retention.
Another simple but powerful illustration of the time-value of data is the mobile navigation tool known as Waze. The user interface is depicted in Figure 1.
Figure 1: The mobile application known as Waze uses crowd-sourced information on current traffic and road conditions to better inform your choices of route. (Image credit: © Ted Dunning 2016)
While it’s interesting to know your own current speed, as determined by your car’s speedometer, there’s a great value in the aggregated knowledge of the current speed of many drivers ahead of you on potential routes – the view you get from Waze. The value of this knowledge to you depends largely on getting it in the moment. You want to know the speed of traffic ahead now, not what the traffic conditions were an hour or more earlier.
Being able to gain real time or near real time insights from large-scale data depends in part on designing a system around a streaming architecture. To build these systems, there’s been a lot of interest in fast stream processing systems such as Apache Spark Streaming, with its in-memory near real time capability via micro-batching, or truly real-time streaming tools such as Apache Storm and the newer and increasingly popular Apache Flink. These technologies for low latency data processing are hugely valuable, but they are only part of what is needed.
At the heart of an effective streaming system is a well-chosen messaging technology. For best results you need a tool capable of
- high throughput for multiple consumers of the data
- re-starting from a specific event
- efficient and reliable geo-distributed replication
Apache Kafka is a great choice for a scalable message-passing tool with a good fit for most of these capabilities. Another option is the new technology known as MapR Streams – it’s a high performance, scalable messaging feature integrated into MapR’s converged data platform. MapR Streams is based on the Kafka API and adds even stronger capabilities for geo-distributed replication and higher performance.
Value beyond real time analytics.
With new streaming style architecture that employs an effective messaging tool, it is possible to gain advantages from streaming data that go even beyond what you get from real time analytics. It may be instinctive with streaming data to think in terms of analyzing it as data streams by and then just discarding it, but this “use-it-and-lose-it” approach throws away some potentially great additional value.
A messaging system that can persist event data and efficiently replicate it to geo-distributed locations provides valuable additional options. Recently, at the Strata conference in Singapore, my attention was drawn to several sectors where these additional benefits of properly handled streaming data can really pay off. These include telecommunications, shipping, and finance.
Telecommunication companies have many different needs for streaming data. One is to monitor quickly changing network usage patterns by analyzing streaming event data communicated between cell towers and millions of cell phone or other network users. By being able to handle up to many thousands or even millions of event messages/ sec across several towers in different locations, a telecom can reconfigure network support, engaging auxiliary towers to handle short-term surges, such as heavy usage near a stadium during a sporting event or other local news-heavy situation. The high throughput performance coupled with geo-distributed replication capability of MapR Streams is ideally suited to meet these needs.
Transportation and shipping industries are also on the lookout for more effective ways to carry out low latency analysis of sensor data from equipment in planes, trucks, cars, and ships as well as to provide granular monitoring of logistical data coming from sensors on shipping containers.
Figure 2: Singapore is a high-rise, high-tech city with a variety of large scale streaming data use cases in its financial, telecommunications, and transportation industries. Up to 30% of the world’s shipping passes through its port, and on any given day, for instance, a forest of hundreds to thousands of containers are loaded and unloaded from ships, providing a vast quantity of data from sensors tracking logistics of the containers. (image credit © E. Friedman 2015)
How could streaming data be efficiently handled in this IoT shipping example? It comes into play in several ways. Consider the stakeholders for international container ships: the shipping company that owns the ships, the customers whose goods rest in the containers, the port authorities at different locations, dockworkers and more. They will all be interested in streaming data for different reasons.
Sensors on the containers can send data to track their location at any time. This helps the shipping company plan logistics and changing availability of shipping capacity as they run their business. The same data is of interest to the people who are sending or receiving the contents of the containers – when will their goods arrive? Cargo handlers and dock authorities at different ports need to know exactly how many and which containers have been loaded or unloaded. Sensors on the ship can report environmental conditions such as moisture and temperature, so that the shipper or insurance company knows the conditions to which cargo was exposed.
All this is streaming data. In some cases it may be collected and stored on an on-board cluster; in others it is reported to a data center at a particular port. With MapR Streams, you can not only ingest streaming data from many sources and make it available to multiple consumers, you also have the distinct advantage of being able to efficiently copy to clusters in other locations, such as ship-to-shore or port-to-port. This is an extra-ordinary advantage.
Ability to make use of streaming data provides flexibility and agility in data-driven decisions. Faster response to changing situations can improve time efficiency in many types of transportation, provide more agile response to customer requests for cargo, and improve fuel costs by adjusting routes in response to live events such as traffic or environmental conditions. In this and many other sectors, data streams have a lot to offer.