Monitoring a MapR Cluster with Elasticsearch + Kibana

The MapR Converged Data Platform offers a unified API for all aspects of solving real, mission-critical data problems that enterprises have to deal with today. In this blog post, I would like to share another, much less talked about advantage that emerges from this strategy. This is because a MapR cluster can naturally take advantage of the very well regarded Elasticsearch and Kibana stack to give cluster admins a near real-time view of their cluster’s health and performance.

The solution we present today is a prototype that can collect metrics directly from the MapR REST interface and, using a simple Python driver, extract metrics of interest and forward them to an Elasticsearch server. From this data, it’s easy to create any number of Kibana dashboards to monitor the cluster, from CPU and network bandwidth to tracking volume size. This time series data is ideally suited for the ES + Kibana stack.

Gathering Metrics

Best practice metrics with maprcli

State, performance, and health metrics for MapR clusters can be collected using the maprcli command line tool from any node of the cluster as root or as mapr (or any Linux user with cluster login permission).

maprcli dashboard info -json # cluster-wide summary of cluster state such as a volumes, utilization (CPU, memory, disk space), services and YARN. Noteworthy are:

  • utilization — Provides the following utilization information:

    • CPU — utilization, total, and active. CPU utilization % is calculated as (100% - idle%) on each node and then averaged across all nodes where hoststats is running.

    • Memory — total and active in mB.

    • Disk space — total and active in GB.

    • Compression — compressed and uncompressed data size

maprcli node list -json # node specific metrics (detailed). Noteworthy are:

bytesReceived Bytes received by the node since the last CLDB heartbeat.
bytesSent Bytes sent by the node since the last CLDB heartbeat.
davail Disk space available on the node.
dused Disk space used on the node.
dreadK Disk Kbytes read since the last heartbeat.
dwriteK Disk Kbytes written since the last heartbeat.
utilization CPU use percentage since the last heartbeat.

maprcli volume list -json # volume specific metrics (detailed) Noteworthy are:

totalused Total space used for volume and snapshots, in MB

REST interface

The MCS (MapR Control System) is actually getting all of its information from the REST API. All the information available from maprcli is also available from the https://<Webserver host&rt;:8443/rest endpoint. Since the output is JSON, it makes it very easy to use as an API for collecting metrics into a central data store like Elasticsearch.

The endpoints of the rest API to collect the recommended metrics are as follows:

maprcli dashboard info -json https://<MCS host>:8443/rest/dashboard/info doc
maprcli node list -json https://<MCS host>:8443/rest/node/list doc
maprcli volume list -json https://<MCS host>:8443/rest/volume/list doc

Querying the rest endpoints In Python 2.7.9 or newer:

import urllib2, base64 import ssl
import json
username = ‘mapr’
password = ‘mapr’
url=“https://<maprcluster_host>:8443/rest/dashboard/info”
request = urllib2.Request(url)
base64string = base64.encodestring(’%s:%s’ % (username, password)).replace(’\n’, ‘’)
request.add_header(“Authorization”, “Basic %s” % base64string)
context = ssl._create_unverified_context()
response = urllib2.urlopen(request, context=context)
data = json.load(response)

Notice that we are using basic authentication using a user with the proper login permissions. From there, it’s easy to get the desired metrics and values out of the Python dictionary data.

The reason we specifically need a newer version than 2.7.9 is due to the use of ssl._create_unverified_context() which is introduced in that version.

An alternative to using the standard Python library would be to use the excellent Requests package, which we love (NOTE: The same code with requests is an easy one liner left as an exercise to the reader). The reason we stick to the standard library is only to eliminate the need for an additional dependency.

Elasticsearch

Elasticsearch is a great choice as a store for the data collected from the MapR REST interface. It is easy to use, very flexibl, and is ideally suited for such time series data. Adding in the cluster configuration and making the settings searchable is also a powerful idea. Since we’re reading the real settings as read directly from MapR, the information thus stored can be trusted and up-to-date.

From Python, adding documents to an Elasticsearch cluster is as simple as using the Elasticsearch Python package:

from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=[(<host>,<port>)])
dashboard_metrics = getMetrics(“https://cluster-node1:8443/rest/dashboard/info”)
res = es.index(index=“metrics”, doc_type="dashboard”, body=dashboard_metrics)

Ref: https://elasticsearch-py.readthedocs.org/en/master/api.html#elasticsearch

Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

The JSON output from dashboard looks like this:

{
"timestamp":1461661363309,
"timeofday":"2016-04-26 06:02:43.309 GMT+0900",
"status":"OK",
"total":1,
"data":[
    {
        "version":"5.1.0.37549.GA",
        "cluster":{
            "name":"my_prod_cluster",
            "secure":false,
            "ip":"192.168.2.1",
            "id":"123769123634185",
            "nodesUsed":3,
            "totalNodesAllowed":6
        },
“volumes”:{
            "mounted":{
                "total":32,
                "size":57865
            },
            "unmounted":{
                "total":2,
                "size":2
            }
        },
        "utilization":{
            "cpu":{
                "util":97,
                "total":12,
                "active":11
            },
            "memory":{
                "total":47364,
                "active":23072
            },
            "disk_space":{
                "total":148,
                "active":105
            },
            "compression":{
                "compressed":56,
                "uncompressed":89
            }
        },

There are fields that we care about from this JSON response: the timestamp and the content of the data array.

We need to index our monitoring data into ES as time series data. With such data, it’s easy to make some pretty cool and useful customized monitoring dashboards with Kibana with zero configuration effort.

The monitoring data, like the “utilization” field, can be selected programmatically in Python, as JSON data is just a set of nested dictionaries and lists. Each bit is indexed into ES as a document adding in the timestamp information.

We recommend creating the index first with all the necessary mappings and making sure that the timestamp field is set to the type “date” and format “epoch_millis” (link to doc)

So for CPU load for example, we could add the following document to ES this way:

cpu_doc ={
"Util":97,
“total”:12,
"Active":11,
"timestamp":1461661363309
}
res = es.index(index=“metrics”, doc_type="cpu”, body=cpu_doc)

Please refer to Elasticsearch’s excellent documentation site for additional info about how to create an index with the suitable mappings. Note that for a POC, it’s fine to not create a mapping and let ES handle it for us, as inserting new documents without a mapping automatically creates a mapping. Our own tests show that the data is correctly mapped for everything but the timestamp field, which can be remapped specifically after the fact (example).

Keeping Elasticsearch Updated

There are several ways to keep the data updated: a cron job, a linux daemon running as a service, or a stream tool such as Streamsets.

The easiest way might be to run the task as a cron job with an interval of one to thirty seconds depending on monitoring needs. This may be suitable for a proof of concept or a small test cluster or even a production cluster. The main drawback of using a cron is that the control over the execution is limited to running the script and resources aren’t shared, meaning we are opening and closing a connection to Elasticsearch as well as doing the work to call the rest endpoint for each invocation.

One solution to this issue is to use a Python library such as python-daemon, which is the reference implementation for PEP 3143 “Standard daemon process library.” Using a daemon is the best way to run long-lived processes such as this one, and will allow all calls to share the connection to ES efficiently and group logs following standard Linux practices. The documentation for this library is not super great, and the best information is from this blog post.

In either case, we recommend creating a configuration file to hold information about the endpoints to query, the ES connection information, and the destination index and doc types. Make sure to output plenty of information to a proper lo, as this monitoring driver needs to run all the time. But as the code is dead simple at its core, the overhead is really minimal.

Finally, a very new open-source tool called StreamSets can be a way to poll the MapR REST API and update an ES index. This tool has a very clean GUI that lets you easily connect a source of data, and poll an endpoint at a certain frequency to a sink such as an ES index, and even allow for a little bit of JavaScript or Python code in the middle for simple data extraction, filtering, or enrichment. In this case, all we need is to run the StreamSet “SDC” service instead of using a Python driver.

Visualizing with Kibana

Kibana requires no configuration—just a running instance of ES and an index to work with. Our testing showed immediate results after running the data collection a few times. The only thing we needed to do was change the mapping of the timestamp field, as the default mapping wasn’t setting it to the date type with “epoch_millis” format. As soon as we changed the mapping, we could create some nice time series graphs.

Mario Talavera’s post shows this example of a monitoring dashboard using Kibana:

monitoring dashboard using Kibana

Abronner’s Github account shows another example of cluster monitoring with ES/Kibana. The code doesn’t include any scripts to help with data collection, though.

Conclusion

Monitoring a production cluster with Elasticsearch and Kibana makes a lot of sense. Admins can create dashboards easily with zero code. Elasticsearch is a fast and scalable solution for keeping time series data, with an easy and well-documented API.

What MapR brings to the table here is a single, centralized REST API with all of the necessary metrics, without needing to install and configure any additional tools whatsoever.

Although we suggest using Python for connecting the MapR metrics source to the ES server, it would be equally possible to use JavaScript, Ruby, or Java, or even a separate tool like StreamSets. Developing the entire prototype code took less than a day.

This combination of tools offers a compelling list of advantages, and we recommend that you take a look at it for your production monitoring needs.

no

CTA_Inside

Ebook: Getting Started with Apache Spark
Interested in Apache Spark? Experience our interactive ebook with real code, running in real time, to learn more about Spark.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free