Big Data. Everyone from your Uncle Bill to your local mail carrier has heard the term in some form or another. However, the plethora of big data media coverage brings with it a lot of terminology that can be confusing, especially when it comes to data processing.
Here’s some orientation to key data processing terms you will encounter once you start to look deeper into the Big Data processing field:
1. Data stream management systems vs. database management systems
A data stream management system (DSMS) refers to an uninterrupted flow of a long sequence of data. In the big data world, streaming data is continuously analyzed and transformed in memory, using tools such as Apache Storm. This is similar to a “running” or “conveyor belt” sushi restaurant, where the sushi is placed on an ongoing “stream” of plates that wind through the restaurant. In this example, you have a seemingly endless supply of sushi that passes before you, and you make your choice from a line of sushi plates that appear in sequential order.
A database management system (DBMS) is a systematically organized repository of indexed information that allows for easy retrieval, updating, analysis, and output of data. This type of processing includes relational databases, NoSQL databases, etc. A DBMS is similar to those organized bins of treats in a candy store – each bin is a repository of a certain kind of candy. Your choices are limited to what’s in the available containers, and you can randomly pick candy from each bin.
A summary of the major differences between the running sushi (DSMS) and candy store (DBMS) analogies:
limited by available containers
sequential, as item passes by
2. Batch vs. Interactive Mode
Batch processing refers to the execution of jobs without manual intervention. Jobs are set up so that they can be completed without any human interaction. For example, a program that reads a large file and automatically generates a report is a type of batch job.
Fun fact: in the early days of computer programming, engineers created, edited and stored their programs line by line on punched cards. These card decks were stacked on top of each other in the hopper of a card reader, and were run in batches.
Interactive mode refers to software that receives human input in the form of commands or data. The interaction can be direct or indirect, via a command line interface, GUI or a sensor.
3. Real-Time vs. Latency
In real-time computing, you are guaranteed a response within strict time constraints. Latency refers to the delay between an input received and the corresponding visible output. In the big data world, low latency refers to the fact that the delay is minimized, and looks and feels instantaneous.
Streaming and Hadoop
In Hadoop-land the term ‘streaming’ is heavily overloaded; it can mean different things, depending on the context:
Stream processing system
A scalable DSMS able to process streaming data that typically comes with some sort of HDFS integration or support.
Intermediate or final processing results are made available up-stream in a non-blocking fashion (constraints: available main memory; otherwise spills to disk). Instead of writing out the result to HDFS as a Map task does, the intermediate result is communicated typically via RPC call to an aggregation point.
Apache Drill, Impala, Apache Tez and other systems use this technique.
Can write Hadoop Map/Reduce jobs using any programming language that can read input from stdin and emit the output to stdout. This is to overcome the Java API dependency and somewhat related to what CGI was for web applications. Note: we’re still talking about batch MapReduce jobs here.
If you’d like to dig a little deeper and learn more about streaming and data management, have a look at the following resources: