What is Storm?You may know of Storm if you have an interest in real-time applications. Storm makes it easy to reliably process very large streams of data, so you may find it particularly useful if you have lots of data passing by quickly or when you need to instantly update a dashboard. For example, Storm is a good technology choice for analysis of sensor data or for call detail records (CDR) in telecom. Storm is scalable and fault tolerant, with benchmarks reported at more than a million tuples per second per tuples/sec/node. Another aspect that makes Storm appealing is that it integrates with queuing, database and big data technologies you may already use.
How does a project become an Apache project?Storm was originally started by Nathan Marz when he worked at Back Type before that company was acquired by Twitter. After the acquisition, Twitter and other companies have used Storm for a variety of internal processes. Given that Storm was already an active and fairly widely adopted open source project, you may be surprised to know that for it to become an Apache project involves more than a single decision. In fact, the move involves lots of people and a fair bit of work.
“Joining Apache is a multi-step process,” says Ted Dunning, MapR’s Chief Application Architect, and one of five of people nominated as mentors for Apache Storm. The project champion for Storm at Apache is Doug Cutting. Together, the mentors and champion will facilitate Storm’s transition to Apache. “I’m pleased to have the opportunity to contribute to this outstanding project, even though I won’t be able to help much by actually coding. Storm fills an important gap in the Apache ecosystem.”
“Now that Storm’s proposal to join Apache has been approved, several things are happening to establish the incubator project,” Ted explained. These include
- Setting up mailing lists for subscription, a website and status pages
- Arranging the Apache license agreement
- Checking for trademark issues with the name
- Verifying code licenses are clear for all pre-existing code
- Establishing a source code version control system
- Building the first status report
Ted went on to describe another requirement for an incubator project to graduate to top-level status: more than one significant release. Storm has already had four significant code releases, but these were outside of Apache. “Apache sets a high bar on careful review of licensing, and getting this exactly right is one of the things that new incubator projects often struggle with,” he explained.
What is the advantage of joining the Apache? Reputation is one benefit – companies know that projects in Apache must meet rigorous standards with respect to process and that provides confidence in the software. Another advantage is that the requirement for a strong community means that Storm is likely to be able to last longer than the attention span of any single developer. Both of these factors are crucial for an open source project to get wide-spread adoption, especially in commercial settings.
How do you use Storm?In some cases raw data is not what you want to store, so you might use Storm to aggregate, compress or reformat input data before you actually store anything. Storm can be used to change the format of input data to use space conserving formats like Parquet or ORC files. Storm also could be used to create summaries from streaming data, such as unique counts/minute or totals every minute, with output stored in HBase or MapR M7 tables, directly in the MapR file system. Storm’s focus on real-time is a good match with the real-time nature of the MapR file system (MapR-FS) which works somewhat differently than HDFS of other Hadoop distributions.
One of the challenges with Storm is to use it at scale with Apache Hadoop applications. Hadoop’s heritage is batch processing and this is reflected in the way that HDFS doesn’t support real-time reading of data as it is being written. This can make it difficult to efficiently marry the real-time streaming properties of Storm with Hadoop.MapR’s distribution for Hadoop has been re-engineered to work differently, with some special capabilities in the file system layer. The MapR file system is fully read/write and real-time which makes it possible to use the MapR file system directly as a real-time queue for incoming data. This makes it possible to stream data directly to the cluster to be processed by Storm without having to use an intermediate step on a separate server.
For more information about Apache Storm, see the incubator proposal at http://wiki.apache.org/incubator/StormProposal;
Consider the O’Reilly book Getting Started with Storm by Jonathan Leibiusky, Gabriel Eisbruch, and Dario Simonassi;
See slides for a talk on “Real-time Storm + Hadoop” delivered to the Bay Area Storm user group by Ted Dunning in June 2013.