Big data requires a big vision. This was one of the primary reasons that Warren Sharp was asked to join National Oilwell Varco (NOV) a little over six months ago. NOV is a worldwide leader in the design, manufacture and sale of equipment and components used in oil and gas drilling and production operations and the provision of oilfield services to the upstream oil and gas industry.
Sharp, whose title is Big Data Engineer in NOV’s Corporate Engineering and Technology Group, honed his Big Data analytic skills with a previous employer – a leading waste management company that was collecting information about driver behavior by analyzing GPS data for 15,000 trucks around the country.
The goals are more complicated and challenging at NOV. Says Sharp, “We are creating a data platform for time-series data from sensors and control systems to support the deep analytics and machine learning. This platform will efficiently ingest and store all time-series data from any source within the organization and make it widely available to tools that talk Hadoop or SQL. The first business use case is to support Condition-Based Maintenance efforts by making years of equipment sensor information available to all machine learning applications from a single source.“
For Sharp using the MapR data platform was a given – he was already familiar with its features and capabilities. Coincidentally, his boss-to-be at NOV had already come to the same conclusion six month’s earlier and made MapR a part of their infrastructure. “Learning that MapR was part of the infrastructure was one of the reasons I took the job,” comments Sharp. “I realized we had compatible ideas about how to solve Big Data problems.”
“MapR is relatively easy to install and setup, and the POSIX-compliant NFS-enabled clustered file system makes loading data onto MapR very easy,” Sharp adds. “It is the quickest way to get started with Hadoop and the most flexible in terms of using ecosystem tools. The next step was to figure out which tools in the Hadoop ecosystem to include to create a viable solution.”
The initial goal was to load large volumes of data into OpenTSDB, a time series database. However, Sharp realized that other Hadoop SQL-based tools could not query the native OpenTSDB data table easily. So he designed a partitioned Hive-table to store all ingested data as well. This hybrid storage approach supported options to negotiate the tradeoffs between storage size and query time, and has yielded some interesting results. For example, Hive allowed data to be accessed by common tools such as Spark and Drill for analytics with query times in minutes, whereas OpenTSDB offered for near-instantaneous visualization of months and years of data. The ultimate solution, says Sharp, was to ingest data into a canonical partitioned Hive table for use by Spark and Drill and use Hive to generate files for the OpenTSDB import process.
Storage presented another problem. “Hundreds of billions of data points uses a lot of storage space,” he notes. “Storage space is less expensive now than it’s ever been, but the physical size of the data also affects read times of the data while querying. Understanding the typical read patterns of the data allows us to lay down the data in MapR in a way to maximize the read performance. Moreover, partitioning data by its source and date leads to compact daily files.”
Sharp found both ORC (Optimized Row Columnar) format and Spark were essential tools for handling time-series data and analytic queries over larger time ranges.
As a result of his efforts, he has created a very compact, lossless storage mechanism for sensor data. Each terabyte of storage has the capacity to store 750 billion to 5 trillion data points. This is equivalent to 20,000 – 150,000 sensor-years of 1 Hz data and will allow NOV to store all sensor data on a single MapR cluster.
“Our organization now has platform data capabilities to enable Condition-Based Maintenance,” Sharp says. “All sensor data are accessible by any authorized user or application at any time for analytics, machine learning, and visualization with Hive, Spark, OpenTSDB and other vendor software. The Data Science and Product teams have all the tools and data necessary to build, test, and deliver complicated CBM models and applications.”
When asked what advice he might have for other potential Big Data All Stars, Sharp comments, “Have a big vision. Use cases are great to get started, vision is critical to creating a sustainable platform.” “Learn as much of the ecosystem as you can, what each tool does and how it can be applied. End-to-end solutions won’t come from a single tool or implementation, but rather by assembling the use of a broad range of available Big Data tools to create solutions.”
Originally published in Datanami:
Making Big Data Work for a Major Oil & Gas Equipment Manufacturer
November 16, 2015