Twitter, as we all know, is a powerful social media platform that can be used to harness incredibly useful information about products, brands and customer experience. This blog will explain how to:
- Quickly configure an environment to stream Twitter data (filtered on keywords and languages) using Apache Flume
- Analyze the data in native JSON format with SQL using Apache Drill
- Run interactive reports and analysis using MicroStrategy
To configure the environment (on AWS), we will go through the following steps:
- Create a Twitter Dev account and register a Twitter application
- Provision a preconfigured AWS MapR node with Flume and Drill
- Provision a MicroStrategy AWS instance
- Configure MicroStrategy to run reports and analyses using Apache Drill
Create a Twitter Dev account and register an application
In this section you will create a Twitter development account and register a Twitter Application that will allow you to establish a Twitter feed. It also explains how to get the required Twitter credentials required by Flume to establish Twitter as a source.
- Go to dev.twitter.com and sign in with your Twitter account details.
- Click on “Manage Your Apps” which is under Tools in the page footer.
- Click on the “Create New App” button and fill in the form, then create the application.
- Now create your access token by going to the Keys and Access Tokens tab, then click on the “Create my Access Token” button. (Note: Read-only access is all that is needed).
- Copy the following credentials for the Twitter App, as it will be used to configure Flume: Consumer Key, Consumer Secret, Access Token and Access Token Secret.
Provision preconfigured MapR node on AWS
This section describes how to provision a preconfigured MapR node on AWS that is already configured with Flume and Drill, as well as the specific elements to support data streaming from Twitter and Drill query views.
- In AWS, launch an instance. The AMI image is preconfigured to use a m2.2xlarge instance type with 4 vCPUs and 32GB of memory.
- Select the AMI id ami-4dedc47d. This AMI is publicly available under Community AMIs.
The AMI will have a 6GB root drive and 100GB data drive. Please note that it is a small node, and very large volumes of data will slow the response time significantly for Twitter data queries.
Make sure that the instance has an external IP assigned; an Elastic IP is preferred, but not essential. Also verify that a security group is used with open TCP and UDP ports on the node. At this time, all ports are left open on the node.
Once the instance has been provisioned and booted up, you have to reboot the node in the AWS EC2 management interface to finalize the configuration.
The node should now be configured with the required Flume and Drill installation, and all that is required is to update the Flume configuration files with the required credentials and keywords.
- Log in as the ec2-user using the AWS credentials.
- Then switch to the mapr user on the node using su – mapr.
- Update the Flume configuration files flume-env.sh and flume.conf in the <FLUME HOME>/conf directory.
See the sample files located at: https://github.com/mapr/mapr-demos/tree/master/drill-twitter-MSTR/flume
For the flume.conf file, enter the Twitter app credentials from the first section, and also the desired keywords, separated by a comma. Keywords can include multiple words separated by a space. Additionally, Tweets can be filtered for specific languages by entering the ISO 639-1 language codes separated by a comma. If no language filtering is required, simply leave the parameter blank. For language codes, see: http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
To start Flume and the data stream, simply go to the <FLUME HOME> directory and execute as user mapr.
First go into a Linux screen terminal by simply typing “screen” in the command line.
- Then start Flume by typing
./bin/flume-ng agent --conf ./conf/ -f ./conf/flume.conf -Dflume.root.logger=INFO,console -n TwitterAgent
You can exit screen by entering Ctrl+a and then hit d to detach. To go back to the screen terminal, simply enter screen –r to reattach.
Twitter data will now be streaming into the system. You can verify volumes by executing du –h /mapr/drill_demo/twitter/feed.
Please note that it takes a while to build up a volume of data in the feed directory. You should allow at least 20-30 minutes to start noticing data in the feed directory.
Drill is already configured and ready, but the data needs be present in the feed directory before any of the queries will function.
Provision MicroStrategy AWS Instance
MicroStrategy provides an AWS instance of various sizes. It comes with a free 30-day trial for the MicroStrategy instance, but note that AWS charges still apply for the platform and OS.
This section covers the steps to provision the MicroStrategy node in AWS.
To start, go to the MicroStrategy website: http://www.microstrategy.com/us/analytics/analytics-on-aws
- Click on “Get started.”
- Then select number of users as appropriate (25 users is a good starting point for most cases).
- Select AWS region. It is highly recommended that the MapR node and the MicroStrategy instance are located in the same AWS region.
- Click on “Continue.”
- Select “Manual Launch” tab.
- EC2 instance of r3.large is sufficient for the 25 user version.
- Click on “Launch with EC2 Console” next to the appropriate region.
- Select r3.large instance and click “Configure Instance Details.”
- Select appropriate network setting and zones. Ideally, place within the same zone and network as the MapR node was provisioned.
It is very important to make sure that the MicroStrategy instance has a Public IP; elastic IP is preferred but not essential.
- Keep default storage.
- Assign a tag to identify the instance.
- Select a security group that allows sufficient access to external IPs. For a test, it is best to open all ports.
- Next, launch instance.
- Once instance is fully provisioned, select the instance in the AWS console and click on “Connect.”
- You can now click on “Get Password” to get the OS Administrator password.
The instance is now accessible with RDP and is using the relevant AWS credentials and security.
For more information, see: http://www.microstrategy.com/Strategy/media/downloads/products/cloud/cloud_aws-user-guide.pdf
In this section, we will go through the steps to configure MicroStrategy to integrate with Drill using the ODBC driver. In addition, we’ll cover how to install a MicroStrategy package with a number of useful prebuild reports for working with Twitter data. These reports can be modified as needed, or used as a template to create new and more interesting reports and analysis models.
- Configure the ODBC driver for Drill on MicroStrategy Analytics as described here: http://drill.apache.org/docs/using-microstrategy-analytics-with-apache-drill/
NOTE: For Quick Start, the v0.08.1.0618 version of the ODBC driver can be used, which is located here: http://package.mapr.com/tools/MapR-ODBC/MapR_Drill/MapRDrill_odbc_v0.08.1.0618/MapRDrillODBC32.msi
The Quick Start package requires that a System DSN named ‘Twitter’ is configured with the ODBC administrator.
The Drill object is part of the package and doesn’t need to be configured.
Make sure that you use the AWS Private IP if both the MapR node and the MicroStrategy instance are located in the same region (which is recommended).
Download the configuration package for MicroStrategy on the Windows system here: https://github.com/mapr/mapr-demos/blob/master/drill-twitter-MSTR/MSTR/DrillTwitterProjectPackage.mmp
(You can either use Git for Windows or the full GitHub for Windows).
First, create a new project with MicroStrategy Developer:
Click on “Create Project” and type a name for the new project.
It is not required to do any steps after the initial create project step. Simply click OK.
The Project should now be visible in MicroStrategy Developer.
Open MicroStrategy Object Manager.
Connect to the required Project Source and login as Administrator.
Select the project that the package should be loaded into.
Then, go to the Tools menu and select Import Configuration Package.
Open the configuration package file and click “proceed.”
The package with the reports will now be available in MicroStrategy.
The reports can be tested and modified in MicroStrategy Developer, and also permissions can be configured as needed.
First, update the schema by clicking on the Schema menu and selecting “Update Schema.”
Select all check boxes and click “Update.”
To create a user and set the Administrator password, expand Administration, then User Manager and click on “Everyone.”
Right click to create a new user, or click on Administrator to edit the password.
The package contains reports in three main categories:
- Volumes - with a number of reports that shows the total volume of Tweets by different date and time designations.
- Top List - where the top Tweets, Retweets, hashtags and users are displayed.
- Specific Terms - where Tweets and Retweets can be measured or listed, based on specific terms in the text of the Tweet itself.
These reports can be copied and modified as needed, and serve as a template on how to query the Twitter data using Drill. There are 18 reports in the package, and most include prompts to allow the user to specify date ranges, output limits where relevant, and enter specific terms as needed.
The reports can be accessed through MicroStrategy Developer or the web interface. The web interface provides easy access to work with the reports and make them available to other users. MicroStrategy Developer provides a more powerful interface to modify reports or add new reports, but requires RDP access to the node.
Using a web browser, enter the URL for the web interface:
http://<MSTR node name or IP address>/MicroStrategy/asp/Main.aspx
Log in with the User (created previously) or Administrator
NOTE: This requires the credentials created initially with Developer.
Once logged in, choose the project that was used to load the analysis package.
Then select “Shared Reports” and the folders with the three main categories of the reports will be visible.
Some reports will require prompts before executing.
Enter the parameters and click on “Run Report” to execute.
Report formatting can be done in the web interface, and various other functions.
To refresh the data or re-enter prompt values, click on the Data Menu and then select Refresh or Re-prompt.
The reports will be located in the Public Objects folder of the project that was chosen to install the package in.
Many of the reports will require user input in the form of prompts to select the desired data. In this example we will select the Top Hashtags report in the right-hand column.
This report requires a Start Date and End Date to specify the date range for data of interest; the default values of the prompts are to select data for the last two months, ending with the current date.
In addition you can specify the limit for the number of Top Hashtags to be returned; the default is to return the top 10 hashtags.
The final result is then displayed as a bar chart with the hashtag and number of times it appeared in the specified data range.
Below are a couple of samples of other reports available in the bundle.
Total volume of tweets by hour
Top Retweets for a date range with original Tweet date and count in the date range.
In this tutorial, you learned how to configure an environment to stream Twitter data using Apache Flume. You then learned how to analyze the data in native JSON format with SQL using Apache Drill, and how to run interactive reports and analysis using MicroStrategy. Let us know if you have any feedback on the tutorial, or if you are running into any issues.
Here are some links you will find useful for getting started with Apache Drill: