Getting started with Druid and MapR Streams

Druid is a high-performance, column-oriented, distributed data store. Druid supports streaming data ingestion and offers insights on events immediately after they occur. Druid can ingest data from multiple data sources, including Apache Kafka.

This article will guide you into the steps configure Druid, to ingest data from MapR Streams. MapR Streams is a distributed messaging system for streaming event data at scale, and it’s integrated into the MapR Converged Data Platform, based on the Apache Kafka API (0.9.0)

Prerequisites

In this article, we will deploy Imply Analytics Platform on one of the MapR Cluster node, if you want to deploy Imply on another node (edge node/application node) you must install and configure the MapR Client.

Create the MapR Streams and Topic

A stream is a collection of topics that you can manage as a group by:

  1. Setting security policies that apply to all topics in that stream
  2. Setting a default number of partitions for each new topic that is created in the stream
  3. Set a time-to-live for messages in every topic in the stream

You can find more information about MapR Streams concepts in the documentation.

On your Mapr Cluster or Sandbox run the following commands:

	maprcli stream create -path /apps/druid -produceperm p -consumeperm p -topicperm p
	maprcli stream topic create -path /apps/druid -topic wikiticker -partitions 3

See the documentation at http://maprdocs.mapr.com/51/MapR_Streams/security.html for more information about the security settings.

Adapt the number of partitions in the topic to your deployment.

Install and use MapR Kafka utilities

Install the mapr-kafka package on your cluster :

	yum install mapr-kafka

Open one other terminal windows and run the producer and consumer kafka utilities using the following commands:

Consumer

	/opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-consumer.sh --topic /apps/druid:wikiticker --new-consumer --bootstrap-server this.will.be.ignored:9092 

The consumer window will be used to follow the messages sent to the /apps/druid:wikiticker topic. Note that with MapR Streams the name of the topic as given to the kafka-console-consume.sh program includes the full path name of the string as well as the topic in the stream.

Install Imply

First, in your favorite terminal, download and unpack the release archive.

	curl -O https://static.imply.io/release/imply-1.3.0.tar.gz
	tar -xzf imply-1.3.0.tar.gz

Add MapR Streams Dependencies

Druid, and Imply Analytics Platform, are packaged with Kafka Client library. To be able to use MapR Streams, you need to replace Kafka library by MapR Streams ones:

cd ./imply-1.3.0/dist/druid/extensions/druid-kafka-indexing-service/
mkdir backup_kafka_lib
mv kafka-clients-0.9.0.1.jar  ./backup_kafka_lib/

cp /opt/mapr/kafka/kafka-0.9.0/libs/kafka-clients-0.9.0.0-mapr-1607.jar .
cp /opt/mapr/lib/hadoop-common-2.7.0.jar .
cp /opt/mapr/lib/mapr-streams-5.2.0-mapr.jar .
cp /opt/mapr/lib/maprdb-5.2.0-mapr.jar .
cp /opt/mapr/lib/maprfs-5.2.0-mapr.jar .
cp /opt/mapr/lib/ojai-1.1.jar .

Adapt the version numbers to your MapR version. Note that you can also use symlinks instead of copy.

Your folder druid-kafka-indexing-service now looks like:

.
|-- backup_kafka_lib
|   `-- kafka-clients-0.9.0.1.jar
|-- druid-kafka-indexing-service-0.9.1.1.jar
|-- hadoop-common-2.7.0.jar
|-- kafka-clients-0.9.0.0-mapr-1607.jar
|-- lz4-1.3.0.jar -> ../.././lib/lz4-1.3.0.jar
|-- maprdb-5.2.0-mapr.jar
|-- maprfs-5.2.0-mapr.jar
|-- mapr-streams-5.2.0-mapr.jar
|-- ojai-1.1.jar
|-- slf4j-api-1.7.6.jar
`-- snappy-java-1.1.1.7.jar

Start Imply

Next, you'll need to start up Imply, which includes Druid, Pivot, and ZooKeeper. You can use the included supervise program to start everything with a single command:

cd ./imply-1.3.0/
bin/supervise -c conf/supervise/quickstart.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you want a clean start after stopping the services, simply remove the var/ directory.

Enable Imply Kafka ingestion

We will use Druid's Kafka indexing service to ingest messages from our newly created wikiticker topic.

Edit the Supervisor Specification

Open the supervisor specification file with your favorite editor from the directory where you have installed Druid:

vi quickstart/wikiticker-kafka-supervisor.json

and change the ioConfig element, located at the end of the file, to match the following entry:

  "ioConfig": {
    "topic": “/apps/druid:wikiticker",
    "consumerProperties": {
      "bootstrap.servers": "localhost:9092”,
      "streams.consumer.default.stream" : "/apps/druid"
    }
  • the topic element contains the fully qualified name of the topic that will be used internally by the Druid supervisor to start new indexing tasks. This includes the full path of the stream containing the topic as well as the topic name.
  • the bootstrap.servers element is ignored by MapR Streams, it can even be removed from the file
  • the streams.consumer.default.stream is the name of the default streams used by the consumer, the indexing service in this case. It allows the Cosumer and Druid API to call some high level methods listTopics()

Start the Indexing Service

To start the service, we will need to submit a supervisor spec to the Druid overlord by running the following command in a new terminal window

cd ./imply-1.3.0/
curl -XPOST -H'Content-Type: application/json' -d @quickstart/wikiticker-kafka-supervisor.json http://localhost:8090/druid/indexer/v1/supervisor

If the supervisor was successfully created, you will get a response containing the ID of the supervisor; in our case we should see {“id":"wikiticker-kafka”}.

For more details about what's going on here, check out the Druid Kafka indexing service documentation.

Send example data

Let's launch a console producer for our topic and send some data!

Run the following command, where {PATH_TO_IMPLY} is replaced by the path to the Imply directory:

export KAFKA_OPTS="-Dfile.encoding=UTF-8"
/opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic /apps/druid:wikiticker < {PATH_TO_IMPLY}/quickstart/wikiticker-2016-06-27-sampled.json

You should see messages in the Consumer terminal window you have started at the beginning of this how to.

Querying your data

After sending data, you can immediately query it using any of the supported query methods. The simplest starting point is to check it out in Pivot at http://localhost:9095. The name of this dataset in Pivot is 'Wikiticker Kafka Tutorial'.

The easiest will be to run the following command from the Imply root directory:

bin/plyql -h 'localhost:8082' -q 'select count(*) from wikiticker-kafka'

with the following result:

┌──────────┐
│ count(*) │
├──────────┤
│ 24433    │
└──────────┘

Conclusion

In this article you have learned how to use Imply Analytics Platform with Druid and MapR Streams. The key element is the installation of MapR Streams libraries instead of Kafka ones, and updating the Kafka topic name in the supervisor configuration.

This article is inspired by Druid Tutorial: "Kafka Indexing Service"

no

CTA_Inside

Streaming Architecture: New Designs Using Apache Kafka and MapR Streams
Learn about new designs for streaming data architecture that help you get real-time insights and greatly improve the efficiency of your organization.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free