Every business has some kind of database. It’s right up there with word processors and spreadsheets as essential business software. Relational databases are one of the most popular types—but as useful as they are, they’re not necessarily the best for every situation. NoSQL databases are becoming popular because they can handle different types of data at scale more efficiently. One usage where NoSQL usage really shines is in capturing time series data.
What is Time Series Data?
Time series data is, as the name suggests, data that is captured in a series over time—in other words, individual snapshots of larger, long-term trends. The time relationships between the data points add meaningful value to the entire data set.
Time series databases show up in places that we normally wouldn’t even think of as databases. A movie is a time series database. A succession of still shots, when played back at 24 frames per second, reveal a complete story.
One of the early examples of time series data discussed in our free ebook on time series databases is the topic of weather. You might notice the weather on a given day, but the weather doesn’t happen in isolation. If you continually track the weather—wind speed, cloud cover, air pressure, temperature, and so on—over time, you’ll get a view of the general climate of a given region. Meteorologists at the National Weather Service use time series data to feed into their models to improve their forecasts. As Nate Silver of fivethirtyeight.com fame points out, despite the bad rap of weather forecasters, their forecasts are very good these days.
Why NoSQL for Time Series Data?
If the idea of time series data sounds intriguing, you might be tempted to use the RDBMS you already have. While relational databases are still very useful, they weren’t really designed for the constant streams that time series data throws at you. With vast amounts of data coming from various sources, you’ll find yourself facing scaling issues. With RDBMSs, as your data sets grow, you’ll either have to upgrade to more powerful servers (scale up) or battle through the resource intensive options to scale out by adding additional nodes. Either way, you will end up spending a lot of money as you gather more data.
NoSQL databases are much more efficient for time series data because they are architected differently from what’s used for the relational model. After all, you don’t really need to join tables if you have a continuous stream. NoSQL databases are designed to scale out on commodity hardware, making it far more cost-effective for continually growing time series data.
So what can you measure using time series data? Lots of things.
One of the most useful functions is capturing financial data. You can track the price of stocks over years, months, days, or even seconds. This allows you to see the financial condition of a public company and decide whether you want to buy stock or not. It also allows for algorithmic trading.
Time series data is also useful for the emerging Internet of Things, capturing measurements from sensors. Coming back to weather, the popular website Weather Underground has an extensive network of weather stations operated by volunteers around the world. Weather Underground uses the data from these stations to offer highly accurate local forecasts. The Weather Channel acquired the company primarily for this network to improve its own forecasting.
Another major use of sensors is in smart meters. If your power company hasn’t installed a smart meter yet, chances are you’ll have one soon. These devices measure your power usage and send the data back to the power company immediately to enable analysis that leads to opportunities for usage recommendations and optimized billing. It’s not just applicable to electricity. With the drought in California and subsequent water rationing, it would be great if customers could see how much water they’re using along with suggestions on how they could reduce their usage.
Network devices already output time series data. Servers and routers have logs that admins can check to diagnose performance or security issues. Over time, operators can see what kind of loads their networks get and when they need to plan for extra capacity. Or they can analyze the complete set of events over time to see what anomalous behavior is indicative of network intrusion attempts. A centralized time series database can make this process much easier.
Add Apache Hadoop for Analysis
You have lots of choices when it comes to databases, but databases alone tend not to be good for large-scale analysis. This is mostly because databases are more about efficient reads and writes of data, not parallelized computations. This is why you should use Apache Hadoop as part of your time series data environment. When analyzing your entire time series data set, using a technology like Hadoop lets you parallelize the massive effort in a divide-and-conquer approach.
Most databases today support some level of Hadoop integration. But an emerging trend is the tighter integration NoSQL and Hadoop. This type of integration lets you run your database and Hadoop in the same cluster. And when running Hadoop and NoSQL in the same cluster, much of the effort spent on data movement and system maintenance as required in other systems can be spared, saving you money and time by avoiding separate copies of data. You can also run a wide variety of sophisticated tools on your data, such as Apache Spark, in that same consolidated cluster to get insights from the data you’ve collected.
Look for the time series data sets that exist in your enterprise and determine how you can take advantage of them. Using a NoSQL database is the right starting point, and you’ll see many NoSQL vendors touting their strengths with managing time series data. But be sure to consider all the large-scale analyses you can run on your time series data; that’s where Hadoop will play an important role. By choosing a consolidated NoSQL/Hadoop deployment, you’ll increase operational efficiency by eliminating cross-cluster data movement and the need to maintain two separate clusters of distinct technologies.