How to Index MapR-DB Data into Elasticsearch on AWS

As a follow-up to my previous post on indexing MapR-DB data into Elasticsearch, I want to describe how to index MapR-DB table data in near real-time into Elasticsearch on Amazon Web Services (AWS) Elastic Compute Cloud (EC2).

One of the common challenges of deploying a search engine is keeping the search indexes synchronized with the source data. In some cases, a batch process using custom code to periodically index new documents is satisfactory, but in many enterprise environments today, real-time (or near real-time) synchronization is required.

In the 5.0 release of MapR, you can create external search indexes on columns, column families, or entire tables in MapR-DB into Elasticsearch and keep the indexes updated in near real-time. That is, when a MapR-DB table gets updated, the new data is instantly replicated over to Elasticsearch. As shown in this post, this capability only requires a few configuration steps to set up.

Example environment:

  • 1 node EC2 Enterprise Database Edition (with MapR-DB) cluster (version 5.0.0)
  • 1 Elasticsearch node (version 1.4.4)

Granted, the configuration above is not recommended for production use, but would suffice for me to demonstrate integration of MapR-DB and Elasticsearch.

This post assumes you have an AWS MapR-DB cluster up and running with the HBase client package installed as below.

  • [root@ip-10-128-228-7 ~]# rpm -qa| grep hbase
  • mapr-hbase-0.98.9.201503251553-1.noarch

Below is a list of services I have configured in my test MapR-DB cluster:

[root@ip-10-128-228-7 ~]# maprcli node list -columns svc
service                                      hostname                                    ip            
gateway,webserver,cldb,fileserver,hoststats  ip-10-128-228-7.us-west-1.compute.internal  10.128.228.7  
[root@ip-10-128-228-7 ~]#

Elasticsearch installation
There are a few ways to install Elasticsearch. In this post, I’m going to use a tarball to install Elasticsearch.

  • ES_HOME = /opt/elasticsearch-1.4.4
  • ES_NAME = AbizerElasticCluster
  • ES_NODE = ec2-54-219-214-156.us-west-1.compute.amazonaws.com

Note: It’s assumed you have a node with an OS (RH 6.5) installed ready to be dedicated as an Elasticsearch node.

  1. Use the link below to download the Elasticsearch tarball (elasticsearch-1.4.4.tar.gz) and copy it over to the Elasticsearch node under /opt:
    https://www.elastic.co/downloads/past-releases/elasticsearch-1-4-4
  2. ) Gunzip and untar the tar file you downloaded and copied to the host.
  3. cd /opt
    gunzip elasticsearch-1.4.4.tar.gz
    tar -xvf elasticsearch-1.4.4.tar
    
  4. Edit /opt/elasticsearch-1.4.4/config/elasticsearch.yml
    • Find the commented-out line for “cluster.name”
    • Uncomment the line and create your own cluster with the ES_NAME (in my example, “AbizerElasticCluster”).
    • Since we have nodes spun up in AWS and we would like to restrict the cluster communication, we would need to disable multicast discovery and configure an initial Elasticsearch master node to discover other Elasticsearch nodes (master or data) when they are added to the cluster.

    To do so, you will need to modify the settings below in the elasticsearch.yml config file.

    # Unicast discovery allows to explicitly control which nodes will be used
    # to discover the cluster. It can be used when multicast is not present,
    # or to restrict the cluster communication-wise.
    #
    # 1. Disable multicast discovery (enabled by default):
    #
    discovery.zen.ping.multicast.enabled: false
    #
    # 2. Configure an initial list of master nodes in the cluster
    #    to perform discovery when new nodes (master or data) are started:
    #
    discovery.zen.ping.unicast.hosts: ["ec2-54-151-49-244.us-west-1.compute.amazonaws.com"]
    

    Note: Unicast discovery allows you to explicitly control which nodes will be used to discover the cluster. This is mainly used when you have a MapR cluster in a different subnet than the Elasticsearch cluster, or multicast is disabled. (You would only give one node details which would act as a transport node; MapR gateways will pass replicated updates from the source MapR cluster to the transport nodes. These nodes are responsible for distributing the updates to the correct nodes in the Elasticsearch cluster).

  5. Run the command below to start Elasticsearch in the background (preferably under screen session)
    /opt/elasticsearch-1.4.4/bin/elasticsearch –d &
  6. [root@ip-10-128-160-140 ~]#  /opt/elasticsearch-1.4.4/bin/elasticsearch –d &
    [1] 11621
    [2015-07-22 13:04:14,451][INFO ][node                        ] [Elf With A Gun] version[1.4.4], pid[11621], build[c88f77f/2015-02-19T13:05:36Z]
    [2015-07-22 13:04:14,451][INFO ][node                        ] [Elf With A Gun] initializing ...
    [2015-07-22 13:04:14,456][INFO ][plugins                    ] [Elf With A Gun] loaded [], sites []
    [2015-07-22 13:04:17,167][INFO ][node                       ] [Elf With A Gun] initialized
    [2015-07-22 13:04:17,167][INFO ][node                       ] [Elf With A Gun] starting ...
    [2015-07-22 13:04:17,330][INFO ][transport                 ] [Elf With A Gun] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.128.160.140:9301]}
    [2015-07-22 13:04:17,340][INFO ][discovery                ] [Elf With A Gun] AbizerElasticCluster/6DzITULZSGemaQsYnZfekA
    [2015-07-22 13:04:20,364][INFO ][cluster.service        ] [Elf With A Gun] new_master [Elf With A Gun][6DzITULZSGemaQsYnZfekA][ip-10-128-160-140][inet[/10.128.160.140:9301]], reason: zen-disco-join (elected_as_master)
    [2015-07-22 13:04:20,392][INFO ][http                         ] [Elf With A Gun] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.128.160.140:9201]}
    [2015-07-22 13:04:20,392][INFO ][node                       ] [Elf With A Gun] started
    [2015-07-22 13:04:20,403][INFO ][gateway                  ] [Elf With A Gun] recovered [0] indices into cluster_state
    
  7. You can verify that Elasticsearch is running and has the right cluster name:
{
  "status" : 200,
  "name" : "Liz Allan",
  "cluster_name" : "AbizerElasticCluster",
  "version" : {
    "number" : "1.4.4",
    "build_hash" : "c88f77ffc81301dfa9dfd81ca2232f09588bd512",
    "build_timestamp" : "2015-02-19T13:05:36Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.3"
  },
  "tagline" : "You Know, for Search"
}

Register your Elasticsearch cluster with MapR
The next step is to make the MapR cluster aware of the Elasticsearch cluster. This is done with the “register-elasticsearch” script. Run below command (On MapR cluster Node) :

/opt/mapr/bin/register-elasticsearch -r ec2-54-219-214-156.us-west-1.compute.amazonaws.com -e  /opt/elasticsearch-1.4.4 -u root -y -f –t
-r    the IP address for the Elasticsearch node that needs to be registered
-e    the home directory for Elasticsearch
-u    the user who can login to ES_NODE and read all the files under the ES_HOME directory (default user is the user who is running the register command)
-y    omit interactive prompts
-t    notifies the cluster to use the nodes listed in -r as transport nodes
-f    forces the registration of the Elasticsearch cluster
[root@ip-10-128-228-7 ~]# /opt/mapr/bin/register-elasticsearch -r ec2-54-219-214-156.us-west-1.compute.amazonaws.com -e  /opt/elasticsearch-1.4.4 -u root -y -f -t
Copying ES files from ec2-54-219-214-156.us-west-1.compute.amazonaws.com to /tmp/es_register_root...
Registering ES cluster AbizerElasticCluster on local MapR cluster.
Your ES cluster AbizerElasticCluster has been successfully registered on the local MapR cluster.
[root@ip-10-128-228-7 ~]#

Wait until it finishes. Once the command is executed, you will have an Elasticsearch target cluster registered in your MapR cluster with messages as seen above. When you run the register script, it copies the Elasticsearch cluster’s configuration file (elasticsearch.yml), JAR files, and plugin JAR files into MapR-FS.

We can list the Elasticsearch cluster registered with our MapR cluster as below.

[root@ip-10-128-228-7 ~]# /opt/mapr/bin/register-elasticsearch -l
Found 1 items
drwxr-xr-x   - root root          3 2015-07-13 21:47 /opt/external/elasticsearch/clusters/AbizerElasticCluster
[root@ip-10-128-228-7 ~]#

Since in our case we decided to use a transport node, it also created a transport.yml which lists the transport node which the gateway will connect to replicate updates.

[root@ip-10-128-228-7 ~]# hadoop fs -ls /opt/external/elasticsearch/clusters/AbizerElasticCluster/config
Found 2 items
-rwxr-xr-x   3 root root      13511 2015-07-13 21:47 /opt/external/elasticsearch/clusters/AbizerElasticCluster/config/elasticsearch.yml
-rwxr-xr-x   3 root root         89 2015-07-13 21:47 /opt/external/elasticsearch/clusters/AbizerElasticCluster/config/transport.yml
[root@ip-10-128-228-7 ~]# hadoop fs -cat /opt/external/elasticsearch/clusters/AbizerElasticCluster/config/transport.yml
transport.client.initial_nodes: [ "ec2-54-219-214-156.us-west-1.compute.amazonaws.com" ]

Note: My previous post describes steps to follow when setting up replication from MapR-DB to Elasticsearch using a node client instead of a transport client.

Create a source table
We can use the tool “loadtest” to load sample data in our source table:

[root@ip-10-128-228-7 ~]#  /opt/mapr/server/tools/loadtest  -table /srctable  -numrows  20
Setting continous mode
2015-07-13  21:57:34, 5012 Program:  loadtest on Host:  NULL IP: 0.0.0.0, Port: 0, PID: 0
21:57:35  0 secs      20  rows     20 rows/s  1ms latency    1ms maxlatency
Overall Rate 6666.67  rows/s Latency  1ms
[root@ip-10-128-228-7  ~]#

The command above creates table “/srctable” and inserts 20 rows into the table for our demonstration.

Source Table Verification
Verify the table indeed has 20 rows (see totalrows) with the following command:
maprcli table info -path /srctable -json

[root@ip-10-128-228-7 ~]# maprcli table info -path /srctable -json
{
	"timestamp":1437585394398,
	"timeofday":"2015-07-22 01:16:34.398 GMT-0400",
	"status":"OK",
	"total":1,
	"data":[
		{
			"path":"/srctable",
			"numregions":1,
			"totallogicalsize":90112,
			"totalphysicalsize":65536,
			"totalcopypendingsize":0,
			"totalrows":20,
			"totalnumberofspills":1,
			"totalnumberofsegments":1,
			"autosplit":true,
			"bulkload":false,
			"regionsizemb":4096,
			"audit":false,
			"maxvalueszinmemindex":100,
			"deletettl":86400,
			"adminaccessperm":"u:root",
			"createrenamefamilyperm":"u:root",
			"bulkloadperm":"u:root",
			"packperm":"u:root",
			"deletefamilyperm":"u:root",
			"replperm":"u:root",
			"splitmergeperm":"u:root",
			"defaultappendperm":"u:root",
			"defaultcompressionperm":"u:root",
			"defaultmemoryperm":"u:root",
			"defaultreadperm":"u:root",
			"defaultversionperm":"u:root",
			"defaultwriteperm":"u:root",
			"uuid":"c49ec08e-87ba-f7c2-95d5-078e6ca45500"
		}
	]
}
[root@ip-10-128-228-7 ~]#

Setup replication from the MapR table to Elasticsearch
To map a MapR-DB source table to an Elasticsearch type (a type is a class of similar documents in Elasticsearch), we run the following command:

maprcli table replica elasticsearch autosetup -path /srctable  -target
AbizerElasticCluster  -index srcdocument -type json

-path      the source table path
-target      the target Elasticsearch cluster name
-index      the name of the index you want to use in Elasticsearch. In the RDBMS world this can be thought of as a database.
-type      the name of the type you want to use within the Elasticsearch index.
In the RDBMS world, this can be thought of as a table.

This command registers the destination Elasticsearch type as a replica of the source table, copies the content of the source table into the Elasticsearch cluster via running CopyTable in the background, and then starts the replication stream to keep the Elasticsearch indexes up to date. Updates to the source table are replicated near real-time by the replication stream. Replication of data to Elasticsearch indexes is asynchronous.

Once the command above finishes successfully (it might take a while for huge tables), Elasticsearch replicas for the source table can be listed as below from the MapR cluster.

[root@ip-10-128-228-7 ~]# maprcli table replica elasticsearch list -path /srctable -json
{
	"timestamp":1437585476929,
	"timeofday":"2015-07-22 01:17:56.929 GMT-0400",
	"status":"OK",
	"total":2,
	"data":[
		{
			"cluster":"Elastic1",
			"target":"AbizerElasticCluster",
			"index":"srcdocument",
			"type":"json",
			"paused":false,
			"throttle":false,
			"idx":3,
			"networkencryption":false,
			"networkcompression":"lz4",
			"isUptodate":true,
			"minPendingTS":0,
			"maxPendingTS":0,
			"bytesPending":0,
			"putsPending":0,
			"bucketsPending":0,
			"uuid":"77249b4a-701d-ae8e-4097-04bdc1a45500"
		}
	]
}
[root@ip-10-128-228-7 ~]#

Verify Data in Elasticsearch
There are few of ways to verify data did make it to Elasticsearch cluster. The easiest way to do so is via curl commands as seen below:

[root@ip-10-128-160-140 ~]# curl -XGET 'http://localhost:9200/_cat/indices'
yellow open srcdocument 5 1 20 0 80.7kb 80.7kb 
[root@ip-10-128-160-140 ~]# 

These values refer to:

  • Health of my index (green/yellow/red) - yellow
  • Status (open/close) - open
  • Name of the index - srcdocument
  • Shard count - 5
  • Number of Replicas - 1
  • Document count - 20
  • Deleted document count - 0
  • Store size - 80.7kb
  • Primary store size - 80.7kb

Since we see the number of documents is 20 in the Elasticsearch cluster which corresponds to 20 rows in the MapR-DB table, we have verified data did make it into the Elasticsearch cluster.

In this blog post, you’ve learned how to index MapR-DB data into Elasticsearch on AWS. If you have any further questions, please ask them in the comments section below.

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free