Why do this?
There are many use cases for time series data, and they usually require handling a decent data ingest rate. Rates of more than 10,000 points per second are common and rates of 1 million points per second are not quite as common, but not outrageously high either.
Especially at the higher data rates, many of these systems are used to monitor critical infrastructure where failure of the monitoring systems can lead directly to disastrous failures of the actual system being monitored.
When you care about whether your time series database will actually work at full loads, it is only reasonable to test it at full load. To test at full load, however, you have to first fill the database with, typically, a year or three of data. This is where you will suddenly find yourself with a much higher data rate requirement. It makes no sense to test a system with 3 years of data if it takes three years to fill the database in preparation for the test.
Thus, if you have a production data rate of 100,000 points per second, you will need to ingest test data at a rate of 100 million points per second if you want to load 3 years of data in less than 2 days. Even a production rate of 10,000 points per second will require several hours to load test data.
Similarly, if you are starting a new database, you probably will want to load significant amounts of historical data. That could be just as stressful as loading the database for testing.
How did we do this?
There are three major components that should be considered:
- Loading one point at a time is very slow; batching must be implemented in order to accomplish faster ingestion rates.
- The data store needs to be highly performant in order to consistently handle such massive amounts of data.
- Well-configured hardware is critical to achieving the highest performance.
First, OpenTSDB was used for this test because it is a well documented time series database and it uses the HBase API for its data store. OpenTSDB also uses the non-relational nature of the HBase API to strong advantage by storing data in a hybrid wide data schema. In our tests, we used one year of data. Sampling one sensor every second for one year generates about 31.5 million points. In OpenTSDB’s storage format, this translates to about 120MB of data per sensor year of data.
Second, MapR-DB was used for this test instead of HBase itself. MapR-DB offers many benefits over HBase, while maintaining the virtues of the HBase API and the idea of data being sorted according to primary key. MapR-DB provides operational benefits such as no compaction delays and automated region splits that do not impact the performance of the database. The tables in MapR-DB can also be isolated to certain machines in a cluster by utilizing the topology feature of MapR. This allowed the testing to easily be scaled to measure the performance on any number of nodes in the cluster. The final differentiator is that MapR-DB is just plain fast, due primarily to the fact that it is tightly integrated into the MapR file system itself, rather than being layered on top of a distributed file system that is layered on top of a conventional file system.
Third, the hardware that was used was fast. The test cluster was designed by PSSC Labs specifically for use with Hadoop systems. The CloudOOP 12000 platform offers a unique, highly optimized system design with a direct data path for disk I/O for up to 12 SATA/SAS devices in a 1u rackmount chassis. This hardware explicitly lacks a raid controller and a backplane, as its intended use is for Hadoop and would otherwise be unused. Here are the specifications for this test setup:
- Dual Intel E5-2650 v2 Processor (16 total physical cores, 32 hyper-threads)
- 128GB DDR3 1600
- 1 - 120GB SSD for the operating system
- 11 - Western Digital 1TB 7200 RPM 6GB/s
- Solarflare SFN5152 10GigE network adapter
- CentOS 6.x
- Estimated power draw: 240 watt idle / 325 watt 100% load
How were these numbers achieved?
Code was created by modifying portions of OpenTSDB to allow bulk importing of data. This code has been published in the MapR App Gallery and is also available via github at https://github.com/mapr-demos/opentsdb. Fundamentally what is happening is that OpenTSDB stores all points that are for a given hour into a single row in MapR-DB. During normal operations, OpenTSDB inserts each point into the row in a separate column. Once an hour, the entire row is read, columns are collected into a blob and the row is written back to the database. This results in a read and a write per point. In our bulk loading code we directly create the blob for an entire hour of data (3,600 points in our test) and insert that blob into the database in a single put. This single put takes the place of over 7,000 database operations per sensor-hour of data.
We used two edge nodes, both running code to generate and load test data. While these edge nodes were generating data, we observed near 100% CPU utilization on the edge nodes. These edge nodes were not quite hitting their outbound bandwidth limit, as they couldn’t generate data any faster than the CPU bottleneck would allow. To achieve maximum performance, we used eight separate Java virtual machines, each running two threads actively generating the data. Other configurations with more or fewer JVMs and more or fewer threads per JVM gave equal or inferior results.
We limited the output data to only four nodes out of a ten node cluster. Our original plan was to start with four nodes and increase the nodes involved in the test until we achieved our goal of more than 100 million points per second. This test generated random integers that emulated 128 sensors sampled at one second intervals for one year. This yielded over four billion points totaling about 15GB on disk storage. While this data size is less than the memory available on the test nodes, writes to the database are all fully persisted to disk during the test. The ingest rate was approximately 110 million points per second and the data load completed in about 36 seconds. The MapR-DB table was set to the default of 4GB regions, which meant that multiple region splits would occur during the tests. The data replication for the table was set to three, meaning every record was duplicated from the first server to two other servers.
At peak loading, we observed that the two edge nodes together were pushing approximately 1.5GB/s of total data from the edge nodes to the cluster nodes. Inbound data volumes on individual cluster nodes were variable, with peaks near 1.3GB/s. Impressively, the 7200 RPM disks were able to keep up with the network speed and were able to persist data with no issues. Data generation was limited by CPU capacity on the edge nodes. Ingest volume was limited by network bandwidth on the cluster side. It was clear that our loading did not drive the disk subsystems to capacity. Adding more cluster nodes will distribute the network load within the cluster, but we expect that the edge node count is the primary limit on ingestion in the current test configuration.
Could it get any faster?
The storage format of OpenTSDB is reasonable, but could be improved significantly. For instance, time values are stored in the column name for the blob of data. Using a single column name and a compressed binary representation of the data would allow substantial compression of the data. One example of this compression is in the times themselves. With delta coding, the total storage for each time in a blob format could be reduced to less than 1% of the current size. The flags which indicate the time resolution of the timestamps and whether the data is floating point could be repeated only once. For sensor data, resolution is typically limited so that dictionary encodings would likely result in up to 10x compression.
In terms of hardware, the networking bandwidth could be increased, perhaps utilizing dual 10GbE NICs or even higher-end networking gear. The MapR distribution supports transparent application level multiplexing of multiple network interfaces, allowing over 2GB/s of data to be transferred over the network. There is also a possibility that performance would improve with faster disks. These changes would increase the cost of the servers, so it would be worthwhile to understand the cost-benefit of the actual gains.
Performance could also be trivially increased by scaling the system out so that the main data table is spread across more than four nodes.
Trying to reproduce this? Have results you would like to share? Add a comment below.