What Kind of Hive Table is Best for Your Data?

Over the last few releases, the options for how you store data in Hive has advanced in many ways. In this post, let’s take a look at how to go about determining what Hive table storage format would be best for the data you are using. Starting with a basic table, we’ll look at creating duplicate tables for each of the storage format options, and then comparing queries and data compression. Just keep in mind that the goal of this post is to talk about ways of comparing table formats and compression options, and not define the fastest Hive setup for all things data. After all, the fun is in figuring out the Hive table storage format for your own Hive project, and not just reading about mine.

The Hadoop Cluster Layout

For our discussion today, I used a MapR Hadoop cluster consisting of 5 nodes in an Amazon EC2 environment. The MapR Version is 4.0.1 (Hadoop 2.4.1) running Hive 0.13. I have MRv1 and YARN running on all nodes, so I can run comparisons between legacy MRv1 jobs and YARN-controlled jobs.

Cluster Diagram

In this cluster, I spun up 5 nodes in an Amazon EC2 multi-tenant environment. The systems are running CentOS 6.5. I’ll focus on the Hive aspects for the purpose of this blog post, and save the specifics of supporting separate projects with separate SLAs for a later post.

1. Our Starting Data

We have two tables that we will start with. One table consists of information that details bike rental stations. The second table contains information on trips where bikes were rented and where they were turned in. The starting dataset is stored as a standard text table with delimited columns.

     1.1      Table 1: Bike Stations
This table has information on the rental stations. Here is an example of some of the fields and data:

     1.2      Table 2: Bike Trips
This table has information on bike rental activity. Here is an example of some of the fields and data:

2. Our Table Storage Options

Now that we have an idea of what is in our original data set, we can take a look at the storage options available in Hive. You can add or subtract to the list of storage formats, but for this example, we will look at our starting text tables, RC and ORC. This section covers what some of these formats mean and “why” tables are laid out this way in my MapR cluster as opposed to the “how.” Don’t worry, we’ll cover the “how” later.

     2.1      Different Storage Formats

        2.1.1      Text File
Text is where Hive started, and has evolved into handling just about any text file with any separation you may be looking for. It’s one of the things that gives Hive the ability to get your data from files into SQL-EC2 multi-tenant-fed tools.

This is our text file setup:

ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

        2.1.2      RC: Record Columnar File
The RC format was designed for clusters with MapReduce in mind. It is a huge step up over standard text files. It’s a mature format with ways to ingest into the cluster without ETL. It is supported in several Hadoop system components. For our comparison, we will ETL the data from text into the RC table using Hive.

The full table creation and load process is covered later, but this is what our table format looks like:

ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'

        2.1.3      ORC: Optimized Row Columnar File
The ORC format showed up in Hive 0.11. As the name implies, it is more optimized than the RC format. If you want to hold onto speed and compress the data as much as possible, then ORC is for you. We won’t be digging into the how or why the “O” in ORC works—we’re just taking it for granted that it does and will be using it in our comparison.

Our ORC settings in our ORC table:

ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.ORCFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.ORCFileOutputFormat'

     2.2      Table Locations and Partitions
Do I use an external table? Do I use partitions? Do file system settings make a difference? All good questions, but in this section we’ll lay out some of those options to give a scope of reference on deciding what you want to include in your comparison for your project.

        2.2.1      External or Internal
The main difference between internal and external tables is a matter of perspective. What tool do I expect will control the data—Hive, or something else? To oversimplify this, external tables are normally used when alterations to the data could happen with some other tool, or the data already exists and you want to keep it in its original form and use it in Hive.

For our example, you could use either, or do comparisons around both. In the table creation process later we will do both, but specify the storage location for the tables on the file system.

        2.2.2      File System Volumes
One of the great things about using MapR is the power of logical file system volumes in my Hadoop cluster. On a small cluster with one use case this is not a big deal, but when you get to 300 nodes and 20 projects, all with specific SLAs, then it makes a huge difference. It’s one of the reasons you can support all those SLAs and projects in one cluster. But in this case, we are using separate volumes to help get a solid assessment on the statistics of how the data is stored for each table in our comparison.

3. Data Compression Options

For those of you who can remember Schoolhouse Rock: “Compression, Compression…what’s my compression?” There are several options for compression. For those of you who don’t want to worry about compression, you can just pick an option for the MapR Hadoop file system compression and not worry about it. For those of you who have a drive to tweak all things, then you can run comparisons on the file system, Hive, and mixes of both till smoke comes out your ears. In this post, we are sticking to one set of values for each table format. Different compression settings can affect data in projects differently but the combinations picked, while not the final say in all things data, will hopefully result in some interesting comparisons.

     3.1      Hive Compression Options
Here’s a list of some of the Hive compression options looked at in our example:

     3.2      MapR File System Options
The file system itself can also be set for specific compression formats. The tradeoff is always compression vs. speed. Below is a list of the file system compression options in MapR:

     3.3      Our Comparison Matrix

4. Setting Up Hadoop File System Locations in MapR

The ability to set up logical file system volumes inside Hadoop is a powerful capability not found anywhere else in Hadoop world. It allows you to isolate the access and locality of data, which is handy when your Hadoop system graduates to production or past one use case. Here we are using it to help isolate the statistics of our Hive tables for the purposes of comparison, but if you’re interested in this capability, you can read more about it on our "Get Real with Hadoop: True Multi-tenancy" blog post.

This step is optional, but this capability opens up a world of possibilities when you gain such granular control of your data.

Let’s quickly step through setting up our volumes in our Hadoop cluster where I created our tables.

     4.1      Using MapR Control System UI to Establish File System Volumes
The most common way to create volumes is using the MapR Control System (MCS). All aspects of block replication, snapshot scheduling, data mirroring, access, data locality and data quotas can be set through the MCS.

     4.2      Using MapR REST API to Establish File System Volumes
Anything that is done through the MCS can be done through the REST API. The full documentation for this can be found on our MapR Volume Creation documentation.

To create the same volume pictured above, the following cur command to the rest API would get the job done:

curl -X GET -k -H "Authorization: Basic bWFwcjpyb290NG1hcHI=" -H 
"Cache-Control: no-cache" 
'https://jbates1:8443/rest/volume/create?name=hive_txt&path=/data/hive/text
"a=500M&replication=3&schedule=2&topology=/data&type=0&advisoryquota=100M'

     4.3      Using MapR CLI to Establish File System Volumes
The last method to create a volume is using the MapR CLI. This is the method I used for my volume creation process.

maprcli volume create -name hive_txt -path /data/hive/text -advisoryquota 
100M -quota 500M -replication 3 -schedule 2 -topology "/data" -type 0

     4.4      Volumes Used for This Comparison
Here are all the volumes created from the methods above:

     4.5      Setting the File System Compression Options
As mentioned above, you can use Hive the file system to set the compression. Compression settings are managed at the directory level. Since MapR is a Hadoop system, I can use a Hadoop command to set the compression settings for my directories.

Set compression with something like this: hadoop mfs -setcompression zlib /data/hive/orc1

Verifying the compression setting can be done with this command: hadoop mfs -ls /data/hive

That’s all there is to adjusting file system compression. All new data will be compressed with the provisioned setting. More details on compression can be found at MapR Compression Documentation.

5. Creating the Tables

The text data in my csv format loaded into the file system at /data/hive/text. We have external tables created in Hive partitioned around the year and the month.

     5.1      Original Text Tables
Here are the commands used to create our original Hive tables:

        5.1.1      Bike stations table

CREATE EXTERNAL TABLE `STATIONS`(
  `id` int,
  `installdate` string,
  `name` string,
  `longitude` float,
  `latitude` float)
PARTITIONED BY (
  `year` int,
  `month` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/mapr/demo.jbates.mapr/data/hive/text/bikestations';

        5.1.2      Bike trips table

CREATE EXTERNAL TABLE `TRIPS`(
  `bike_nr` string,
  `duration` int,
  `start_date` string,
  `start_station` string,
  `end_station` string)
PARTITIONED BY (
  `year` int,
  `month` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/mapr/my_cluster/data/hive/text/trips';

     5.2      Hive RC Tables
The RC tables in Hive will have a significant performance increase over our original text files. Table creation is almost identical. In this case, the table location was specified, but it was not built as an external table. Dropping an external table will not drop the data, but with this one, dropping it discards the dataset.

        5.2.1      stations_rc table

CREATE TABLE `STATIONS_RC`(
  `id` int,
  `installdate` string,
  `name` string,
  `longitude` float,
  `latitude` float)
PARTITIONED BY (
  `year` int,
  `month` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
LOCATION
  'maprfs:/mapr/demo.jbates.mapr/data/hive/rc/bikestations';

        5.2.2      trips_rc table

CREATE TABLE `TRIPS_RC`(
  `bike_nr` string,
  `duration` int,
  `start_date` string,
  `start_station` string,
  `end_station` string)
PARTITIONED BY (
  `year` int,
  `month` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
LOCATION
  'maprfs:/mapr/demo.jbates.mapr/data/hive/rc/trips;

     5.3      Hive ORC Tables
With the ORC tables here, we added the wrinkle of setting a table property in the table creation process that will set the compression settings for our table.

        5.3.1      stations_orc1 table

CREATE EXTERNAL TABLE `STATIONS_ ORC1`(
  `id` int,
  `installdate` string,
  `name` string,
  `longitude` float,
  `latitude` float)
PARTITIONED BY (
  `year` int,
  `month` string)
STORED AS ORC
LOCATION
  'maprfs:/mapr/demo.jbates.mapr/data/hive/orc1/bikestations'
TBLPROPERTIES ( "orc.compress"="NONE" );

        5.3.2      trips_orc1 table

CREATE EXTERNAL TABLE `TRIPS_ ORC1`(
  `bike_nr` string,
  `duration` int,
  `start_date` string,
  `start_station` string,
  `end_station` string)
PARTITIONED BY (
  `year` int,
  `month` string)
STORED AS ORC
LOCATION
  'maprfs:/mapr/demo.jbates.mapr/data/hive/orc1/trips
TBLPROPERTIES ( "orc.compress"="NONE" );

        5.3.3      stations_orc2 table

CREATE EXTERNAL TABLE `STATIONS_ ORC2`(
  `id` int,
  `installdate` string,
  `name` string,
  `longitude` float,
  `latitude` float)
PARTITIONED BY (
  `year` int,
  `month` string)
LOCATION
  'maprfs:/mapr/demo.jbates.mapr/data/hive/orc2/bikestations'
TBLPROPERTIES ( "orc.compress"="LZ4" );

        5.3.4      trips_orc2 table

CREATE EXTERNAL TABLE `TRIPS_ORC2`(
  `bike_nr` string,
  `duration` int,
  `start_date` string,
  `start_station` string,
  `end_station` string)
PARTITIONED BY (
  `year` int,
  `month` string)
STORED AS ORC
LOCATION
  'maprfs:/mapr/demo.jbates.mapr/data/hive/orc2/trips
TBLPROPERTIES ( "orc.compress"="LZ4" );

6. Loading Our Data into the Tables

Now that our tables are all created, we can load the RC and ORC tables from the original text dataset. There are all kinds of examples on creating a new table from an old table, but those are for simple tables. When you add in some partitions, things get more complicated. All the tables we have created are partitioned. In general, partitions improve performance on larger datasets. When writing data into your table, the partitions must be named. Below are examples of copying data into the new tables.

     6.1      Writing to the RC Tables
The RC tables in Hive will have a significant performance increase over our original tables. In order to have the correct compression, we need to set that before we load the data. The commands here will write the data into the rc tables with the correct compression values.

        6.1.1      Loading stations_rc

set mapred.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.default.rcfile.serde=org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
set zlib.compress.level=BEST_SPEED;
set hive.exec.compress.output=true;
set mapred.output.compress=true;
INSERT INTO TABLE bikes.stations_rc partition(year=2014,month="nov") 
SELECT `id`, `installdate`, `name`, `longitude`, `latitude`
FROM bikes.stations WHERE year=2014 AND month="nov";

        6.1.2      Loading trips_rc

set mapred.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.default.rcfile.serde=org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
set zlib.compress.level=BEST_SPEED;
set hive.exec.compress.output=true;
set mapred.output.compress=true;
INSERT INTO TABLE bikes.trips_rc partition(year=2014,month="dec") 
SELECT `bike_nr`, `duration`, `start_date`, `start_station`, `end_station`
FROM bikes.trips WHERE year=2014 AND month="dec";

     6.2      Writing to the ORC Tables
The process of putting data into the ORC tables is a little simpler, since the compression is set in the table properties. Since we are using partitioned tables, that part remains unchanged.

        6.2.1      Loading stations_orc1

INSERT INTO TABLE bikes.stations_orc1 partition(year=2014,month="nov") 
SELECT `id`, `installdate`, `name`, `longitude`, `latitude`
FROM bikes.stations WHERE year=2014 AND month="nov";

        6.2.2      Loading trips_orc1

INSERT INTO TABLE bikes.trips_orc1 partition(year=2014,month="dec") 
SELECT `bike_nr`, `duration`, `start_date`, `start_station`, `end_station`
FROM bikes.trips WHERE year=2014 AND month="dec";

        6.2.3      Loading stations_orc2

INSERT INTO TABLE bikes.stations_orc2 partition(year=2014,month="nov") 
SELECT `id`, `installdate`, `name`, `longitude`, `latitude`
FROM bikes.stations WHERE year=2014 AND month="nov";

        6.2.4      Loading trips_orc2

INSERT INTO TABLE bikes.trips_orc2 partition(year=2014,month="dec") 
SELECT `bike_nr`, `duration`, `start_date`, `start_station`, `end_station`
FROM bikes.trips WHERE year=2014 AND month="dec";

     6.3      Hive Table Partitions
The examples above have partitions. Partitions are transparent for the most part, except for two areas. Specify the partition when you’re writing data into your tables. Make sure your partition is loaded into Hive when reading from the tables. Some quick scripting can get you past this without much effort.

7. Comparing Data Compression

OK…the tables are made… compression values set… table format specified…the data is loaded…what does it look like on disk?

Now that we have data in our tables, we can look at the effect of the compression settings on disk. Since I’m using a MapR Hadoop cluster, I have a read/write POSIX file system with native NFS on every node. I can use standard Linux commands to see how large my data set is on disk.

I also put each table in its own logical file system volume, so I can pull stats for each one of those over REST API, CLI or from the MapR Control System UI.

     7.1      Pulling Data Size from the File System
Nothing complicated here. I ran a du command on each of the table directories.

du -h /mapr/my_cluster/data/hive/text

From that I pulled the values below:

From the data above, it looks like the table using ORC SNAPPY compression is the winner, but that may not be the not the case. If you remember, earlier we set up different file system compressions as well. The command above will not reflect that.

     7.2      Pulling Data Size from the MapR Volumes
It looks like the SNAPPY compression in the ORC table worked best, but let’s look at the information from the MapR file system volumes. I used a simple cli command to pull my values. I’m pulling the logical space used, which should be close to the table above and the actual space used.

maprcli volume info -name hive_txt -columns logicalUsed,used

The results are…

It looks like the MapR file system zlib settings compressed well in the first ORC table group and the snappy compression in the second ORC table group landing right next to it, but there are always the questions of performance.

8. Executing Query Comparisons

So far, we have created and loaded our Hive tables, and we have looked at the storage and compression options. Now we need to execute some queries and see what we see. This post is not the end all for everything Hive; the goal is just to get you started looking at your own data in your own cluster. With that in mind, I quickly picked a few queries that I thought were interesting, and used them to evaluate the settings I had configured in my tables.

     8.1      The Queries
My query goals: I wanted dome counting, some joins, and some conversions, and some MapReduce jobs to run. With that in mind, this is what I ended up with:

        8.1.1      Query 1: Joining My Trip and Station Info
This query joins my trip data with my station data, so I can get full information on where the trip started and where it ended.

SELECT a.start_date, a.bike_nr, a.duration, b.start_station_id, b.start_latitude, b.start_longitude, b.stat_station, c.end_station_id, c.end_latitude, c.end_longitude, c.end_station
FROM (SELECT duration, bike_nr, start_date, cast(split(trips_rc.start_station, '/')[4] as int) as start_station, cast(split(trips_rc.end_station, '/')[4] as int) as end_station FROM trips_rc) a 
JOIN (SELECT id as start_station_id, longitude as start_longitude, latitude as start_latitude, name as stat_station FROM stations_rc) b ON a.start_station = b.start_station_id
JOIN (SELECT id as end_station_id,longitude as end_longitude, latitude as end_latitude, name as end_station FROM stations_rc) c ON a.end_station = c.end_station_id;

        8.1.2      Query 2: Bike Utilization and Movement
This query just added up the time the bikes were in service, and added in the joins to get more info on the starting and ending stations.

SELECT a.bike_nr, sum(a.duration)
FROM (SELECT duration, gender, subscription_type, status, bike_nr, start_date, cast(split(trips_rc.start_station, '/')[4] as int) as start_station, cast(split(trips_rc.end_station, '/')[4] as int) as end_station FROM trips_rc) a 
JOIN (SELECT id as start_station_id, longitude as start_longitude, latitude as start_latitude, name as stat_station FROM stations_rc) b ON a.start_station = b.start_station_id
JOIN (SELECT id as end_station_id,longitude as end_longitude, latitude as end_latitude, name as end_station FROM stations_rc) c ON a.end_station = c.end_station_id
WHERE a.duration > 0
GROUP BY a.bike_nr;

        8.1.3      Query 3: Station A
This query takes a look at the amount of times a bike leaves a station compared to the amount of times one is returned to the station.

SELECT s.station_id as station_id, e.end_count as end_count, s.start_count as start_count FROM
(SELECT a.end_station as station_id, count(a.end_station) as end_count FROM (SELECT cast(split(trips_rc.end_station, '/')[4] as int) as end_station FROM trips_rc) a group by a.end_station) e
FULL JOIN
(SELECT b.start_station as station_id, count(b.start_station) as start_count FROM (SELECT cast(split(trips_rc.start_station, '/')[4] as int) as start_station FROM trips_rc ) b group by b.start_station) s
WHERE e.station_id = s.station_id;

     8.2      Comparison Scripts
Now that I have my queries, it’s time to run them in a script. I wanted to run each query for each table multiple times and record the results. Using some of the methods in my earlier posts, POST, I put together the script below to execute each query a number of times to each table and log the results.

#!/bin/bash
# run_compare.sh


# this will print the usage statements and exit
usage() {
	case $1 in
		"")
			echo ""
			echo "Usage: run_compare.sh [-l /my/log/file.txt] [-c run_count] [-h|--help]"
			echo ""
			echo "  This is a quick example of comparing some hive queries to different tables  with bash"
			echo "     The queries and tables are hard coded in the script"
			echo ""
			echo "Params:"
			echo "      -c|--count run_count: default is 10"
			echo "      -h|--help: print this help info and exit"
			echo "      -l|--logfile: default is run_compare.csv in execution dir"
			echo "Examples:"
			echo ""
			echo "		./run_compare.sh -c 100 -l myfile.csv"
			echo ""
			;;
		
	esac
	exit
}

# this will process command line arguments enough 
# to get you to a specific function
args() {
	run_compare $@
}

run_compare() {
	# init params
	database="bikes"
	table_entensions="null rc orc1 orc2"
	count=10
	logfile="run_compare.csv"
	row_count=""
	start_time=""
	end_time=""
	my_value=""
	my_query=""
	name=""

	# process args for this block
	while test $# -gt 0
	do
    	case $1 in
    		-l|--logfile)
            	shift
            	logfile=$1
            	;;
            -c|--count)
            	shift
            	count=$1
            	;;
        	-h|--help)
            	usage pull_with_one_liner
            	;;
        	*)
            	echo >&2 "Invalid argument: $1"
            	usage ""
        	    ;;
    	esac
    	shift
	done
	
	# clean out data from existing log file
	cat /dev/null > $logfile
	
	# execute comparison for specified count
	c=0
	while test $c -lt $count
	do
		let c++
		echo "running round $c"
		for ext in $table_entensions; do
			if [ "$ext" = "null" ]; then
				ext=""
				name="text"
			else
				name=$ext
				ext="_$ext"
			fi
			
			echo "round $c: table group $name" 
			
			# execute Query1
			my_query_name="Query1"
			my_query="set mapred.reduce.tasks=30;"
			my_query="$my_query use $database;"
			my_query="$my_query SELECT a.start_date, a.bike_nr, a.duration, b.start_station_id, b.start_latitude, b.start_longitude, b.stat_station, c.end_station_id, c.end_latitude, c.end_longitude, c.end_station"
			my_query="$my_query FROM (SELECT duration, bike_nr, start_date, cast(split(trips$ext.start_station, '/')[4] as int) as start_station, cast(split(trips$ext.end_station, '/')[4] as int) as end_station FROM trips$ext) a"
			my_query="$my_query JOIN (SELECT id as start_station_id, longitude as start_longitude, latitude as start_latitude, name as stat_station FROM stations$ext) b ON a.start_station = b.start_station_id"
			my_query="$my_query JOIN (SELECT id as end_station_id,longitude as end_longitude, latitude as end_latitude, name as end_station FROM stations$ext) c ON a.end_station = c.end_station_id"
			
			start_time=`date "+%s"`
			my_value=$(hive -S -e "$my_query")
			end_time=`date "+%s"`
			r_count=`echo "$my_value"| wc -l`
			log_it $logfile $name $start_time $end_time $my_query_name "$r_count" "$my_query" 
			
			
			# execute Query 2
			my_query_name="Query2"
			my_query="set mapred.reduce.tasks=30;"
			my_query="$my_query use $database;"
			my_query="$my_query SELECT a.bike_nr, sum(a.duration)"
			my_query="$my_query FROM (SELECT duration, gender, subscription_type, status, bike_nr, start_date, cast(split(trips$ext.start_station, '/')[4] as int) as start_station, cast(split(trips$ext.end_station, '/')[4] as int) as end_station FROM trips$ext) a"
			my_query="$my_query JOIN (SELECT id as start_station_id, longitude as start_longitude, latitude as start_latitude, name as stat_station FROM stations$ext) b ON a.start_station = b.start_station_id"
			my_query="$my_query JOIN (SELECT id as end_station_id,longitude as end_longitude, latitude as end_latitude, name as end_station FROM stations$ext) c ON a.end_station = c.end_station_id"
			my_query="$my_query WHERE a.duration > 0"
			my_query="$my_query GROUP BY a.bike_nr"
			
			start_time=`date "+%s"`
			my_value=$(hive -S -e "$my_query")
			#my_value="MyValue"
			end_time=`date "+%s"`
			r_count=`echo "$my_value"| wc -l`
			log_it $logfile $name $start_time $end_time $my_query_name "$r_count" "$my_query"
			
			# execute Query 3
			my_query_name="Query3"
			my_query="set mapred.reduce.tasks=30;"
			my_query="$my_query use $database;"
			my_query="$my_query SELECT s.station_id as station_id, e.end_count as end_count, s.start_count as start_count FROM"
			my_query="$my_query (SELECT a.end_station as station_id, count(a.end_station) as end_count FROM (SELECT cast(split(trips$ext.end_station, '/')[4] as int) as end_station FROM trips$ext) a group by a.end_station) e"
			my_query="$my_query FULL JOIN"
			my_query="$my_query (SELECT b.start_station as station_id, count(b.start_station) as start_count FROM (SELECT cast(split(trips$ext.start_station, '/')[4] as int) as start_station FROM trips$ext ) b group by b.start_station) s"
			my_query="$my_query WHERE e.station_id = s.station_id"
			
			start_time=`date "+%s"`
			my_value=$(hive -S -e "$my_query")
			end_time=`date "+%s"`
			r_count=`echo "$my_value"| wc -l`
			log_it $logfile $name $start_time $end_time $my_query_name "$r_count" "$my_query" 
			
		done
	done
	exit	
}

# pass in logfile, start, end, query_name, result_count, query
# count result set, and log the data to csv file
log_it() {
	log_file=$1
	n=$2
	start_t=$3
	end_t=$4
	q_name=$5
	result_count=$6
	q=$7
		
	let duration=$end_t-$start_t
	time_run=`date -d @$start_t`
	echo "$n,$q_name,\"$time_run\",$duration,$result_count,\"$q\"" >> $log_file
}

# -------------------------------------------------------------------------------------
# Beginning of script execution
#

args $@

9. What Performance Did We Get?

My goal here is to help anyone new to the game of SQL on Hadoop, as they step through what can be done with their data. I even tweaked some of the queries and ran them against Impala and Apache Drill, since they were all running in my cluster. I ran iterations in YARN as well as MRv1. Below are the averages for performance times for the little set up I ran. Hopefully this will help you in kick-starting your thoughts on running this with your data.

After running 20 iterations on this data set, the results are…

In the compression results you can see that using the ORC format saved space just by the nature of its architecture. Adding in the MapR file compression took it the rest of the way, but the snappy compression that ORC uses also works well. When you look at the RC tables compressed with gzip, you can see a performance increase and significant space savings. For this data set, the combination that performed the best was the ORC table, where the MapR file system handled the compression but the difference was close enough that you could go with either option.

When comparing the response from the Hive data to Impala or Apache Drill, you can start to see where things are going with SQL on Hadoop. Just looking at the query that took the longest in Hive yields significant improvements in my responses by moving to services that do in-memory querying without needing MapReduce. Still, these gains are not free, and for queries where you don’t mind the wait, Hive works well.

10. Closing Thoughts

The journey into Hive and Hadoop has been an eye opener for me, and has helped me re-think how I can deal with data. Using MapR has helped make the process enjoyable. From a data science perspective, I like having the options of running the Hadoop tools and even pulling in legacy scripts that have no knowledge of Hadoop. From a data engineering perspective, I like the read/write file system with standard Linux capabilities. Just being able to edit the data where it sits has saved me a large amount of time. From an IT perspective, I like the ability to run all my Hadoop projects in one cluster. As things grow, the level of effort needed to monitor and maintain the gear stays fairly constant.

As SQL on Hadoop and Hive continues to evolve, it is beneficial to re-look at what you are doing or what you may have done in the past. Being able to look at all the capabilities of Hive as well as all the other SQL possibilities on Hadoop in one cluster just makes life so much simpler as you go from project to production. The point here is not to go with the results that I found with this data set, but to jump into your own data and see where the tables take you.

Good luck,
Jimmy Bates

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free