Quick Tips for Using the Hive Shell Inside Scripts

There are many great examples out there for using the Hive shell, as well as examples of ways to automate many of the animals in our Hadoop zoo. However, if you’re just getting started, or need something fast that won’t stay around long, then all you need to do is throw a few lines of code together with some existing programs in order to avoid re-inventing the workflow. In this blog post, I’ll share a few quick tips on using the Hive shell inside scripts. We’ll take a look at a simple script that needs to pull an item or count, and then look at two ways to use the Hive shell to get an answer.

All of the steps in this example were executed on Hive 0.13 on a MapR cluster using CentOS.

A Look at the Data Setup

In this setup, we have a data set that consists of travel times for specific traffic routes. We’ve pulled together travel times for days of the year, so when someone wants to know what to expect, we can quickly pull the info. In the past this was done on a traditional RDBMS, but we moved it off that in order to save on processing time and money, and are now using the MapR Distribution including Hadoop. The table is an external table and is tab-delimited. There are more efficient ways to store the data, but we’ll cover those comparisons in a later post. The table is also partitioned by collection day. If you’re not familiar with a partition, it just means that when we enter data into our table, it is grouped into folders of records on the file system. The data in each folder in this case represents a day.

CREATE EXTERNAL TABLE traffic_speeds (
  `id` STRING,
  `collectiontime` string,
  `speed` STRING,
  `traveltime` STRING,
  `linkpoints` STRING)
PARTITIONED BY (
  `collectionday` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION
 'maprfs:/mapr/my_cluster/data/traffic/trafficspeeds';

Here is a sample of some typical data:

Ways to Use Hive

In this example, we’re going to use a simple Bash script to extract some values out of a Hive table using the Hive shell. The same methods can be used for just about any scripting language. In this case, we are going to take a few parameters as arguments and then execute them on Hive using the Hive shell. The results are captured as a variable and then echoed to standard out. There are two basic ways to pass things off to the Hive shell to execute. This script will walk through both of those options. The first option will consist of the Hive command running an external HQL script and returning a value. The second option will consist of passing all the needed commands to Hive at one time and returning a value.

Option 1: Using Hive in Conjunction with an HQL Script

One way to use Hive inside a script is to have the Hive shell execute an HQL file. This is accomplished with the file option. hive -f my_script.hql

An HQL script is just a series of Hive query language commands. They are the same ones you would use in the Hive shell.

example.hql:

show databases;
show tables;
set dfs.block.size=1073741824;
select * from my_database.my_table limit 10;

In order to make this a process that can be used more generically, Hive allows you to pass in variables using the -hiveconf option hive -hiveconf MY_VAR1=value1 -hiveconf MY_VAR2=value2 -f my_script.hql

Inside the HQL file, it is used like this:

show databases;
show tables;
set dfs.block.size=${hiveconf:MY_ VAR1};
select * from ${hiveconf:MY_ VAR2} limit 10;

Operational messages are not typically passed back to the parent script and can be removed with the silent options -S. This will make things far less chatty. hive -S -hiveconf MY_VAR1=value1 -hiveconf MY_VAR2=value2 -f my_script.hql

Pulling a value from the command will vary depending on your script language of choice, but one thing that will hold true for all of them is ensuring that only one Hive command that returns rows or values is returned. In the example above, we have three. If the data from the select statement is what I want to get cleanly, then I would remove the show commands from the HQL file. set dfs.block.size=${hiveconf:MY_ VAR1}; select * from ${hiveconf:MY_ VAR2} limit 10;

An example of pulling a value using Bash looks something like this: my_value=`hive -S -hiveconf MY_VAR1=value1 -hiveconf MY_VAR2=value2 -f my_script.hql`

Option 2: Passing Commands Directly to Hive

Another way to use Hive inside a script is to pass the HQL commands as a query string for the Hive shell to execute. This is accomplished with the -e option. hive -e "select * from my_database.my_table limit 10;"

You can actually add several HQL commands in that string, which comes in handy when you have to specify a database because the next command only has a table option. An example is when you load or add a partition with the ALTER command. hive -e "USE my_database; alter table my_table add if not exists partition(my_partition=my_value);"

As with the previous example, the only values returned are values pulled from the dataset. Hive operational messages may show up on the screen, but are not included in the response. You can remove that chatter by specifying silent mode again with the -S option.

hive -S -e "USE my_database; alter table my_table add if not exists partition(my_partition=my_value); select * from my_table  limit 10;"

As I mentioned earlier, pulling a value from the command will vary depending on your script language of choice. The one thing that will hold true for all of them is ensuring that only one Hive command has rows or values that are returned. Another example of pulling a value using Bash looks something like this:

my_value=$( hive -S -e "USE my_database; alter table my_table add if not exists partition(my_partition=my_value); select * from my_table  limit 10;")

An Example Script That Uses Hive

Let’s take a look at a Bash script that runs either of the options mentioned above in a more generic fashion in order to pull average traffic transit times for a given station.

Script Overview

The script for the rest of the doc is called “hive_scripts.sh”. It allows for two operations: “pull_with_hql_script” or “pull_with_one_liner”. The basic flow is: ./hive_scripts.sh command [options]

Example Using External Hive Files with an HQL Script

Let’s take a look at using Hive HQL files in conjunction with a Bash script in order to ascertain a given station’s average traffic speed.

Command arguments

The arguments for “./hive_script.sh pull_with_hql_script”

Example execution

In this example, the command line arguments are parsed, and passed to Hive to execute in our HQL script.

Command: ./hive_scripts.sh pull_with_hql_script -c id=A2F36 -d default -t traffic_speed -r speed

The select statement is built to execute an average of the traffic times throughout the range specified. In this case, a single station id was selected. select="AVG($requested_column)"

The Hive command is executed and the value is echoed to standard out. The variable Hive_script was hard coded, but could be changed to a passed object like all the other variables.

hive_script="hive_script.hql"
my_value=`hive -S -hiveconf MY_SELECT=$select -hiveconf MY_COMPARISON=$comparison -hiveconf MY_DATABASE=$database -hiveconf MY_TABLE=$table -f $hive_script`
echo "returned value = $my_value"

Example passing all commands to Hive directly

Let’s take a look at passing Hive a set of HQL commands as a string with a Bash script to find a given station’s average traffic speed:

Command arguments

The arguments for “./hive_script.sh pull_with_one_liner”

Example execution

In this example, the command line arguments are parsed, and passed to Hive to execute as a string. Additionally, there is an example of checking to see if partitions exist and are loaded.

Command: ./hive_scripts.sh pull_with_one_liner -c id=A2F36 -d default -t traffic_speed -r speed -b 20130621 -e 20130629

The select statement is built to execute an average of the traffic times throughout the range specified. In this case, a single station id was selected. A date range was also passed and could be used to enhance the data selection and filtering process, but in this case it was used to load partitions for the given range if they were not found.

As in the example above, the select statement is built for use in the command. select="AVG($requested_column)"

The values are passed and used to build the query. The query is passed to Hive and executed. The value is stored as my_value and is echoed to standard out.

# lets build the query we will execute in the hive shell
my_query="set mapred.reduce.tasks=30;"
my_query="$my_query SELECT $select"
my_query="$my_query FROM ("
my_query="$my_query SELECT DISTINCT collectiontime, id, collectionday, speed"
my_query="$my_query FROM $database.$table"
my_query="$my_query WHERE $comparison"
my_query="$my_query ) t"
my_query="$my_query GROUP by id, collectionday;"

# echo the query passed to the hive shell just because
echo "hive -S -e \"$my_query\"" 
my_value=$(hive -e "$my_query")
echo "returned value = $my_value"

hive_scripts.sh

#!/bin/bash
# hive_scripts.sh

# this will print the usage statements and exit
usage() {
	case $1 in
		"")
			echo ""
			echo "Usage: hive_scripts.sh command [options]"
			echo "      for info on each command: --> hive_scripts.sh command -h|--help"
			echo "Commands:"
			echo "      hive_scripts.sh pull_with_hql_script [options]"
			echo "      hive_scripts.sh pull_with_one_liner [options]"
			echo ""
			echo ""
			;;
		pull_with_hql_script)
			echo ""
			echo "Usage: hive_scripts.sh pull_with_hql_script [-h|--help]"
			echo ""
			echo "  This is a quick example of using a hql script with a bash script to pull a value:"
			echo ""
			echo "Params:"
			echo "      -c|--comparison comparison_statement: column=value"
			echo "      -d|--database database_name"
			echo "      -h|--help: print this help info and exit"
			echo "      -r|--requested_column: column to get averaged value from"
			echo "      -t|--table table_name"
			echo "Examples:"
			echo ""
			echo "		./hive_scripts.sh pull_with_hql_script -c a_column_name=a_value -d a_database_name -t a_table_name -r a_column_name"
			echo ""
			;;
		pull_with_one_liner)
			echo ""
			echo "Usage: hive_scripts.sh pull_with_one_liner [-h|--help]"
			echo ""
			echo "  This is a quick example of passing query strings directly to a hive shell with a bash script to pull a value:"
			echo ""
			echo "Params:"
			echo "      -b|--begin_date yyyymmdd"
			echo "      -c|--comparison comparison_statement: column=value"
			echo "      -d|--database database_name"
			echo "      -e|--end_date yyyymmdd"
			echo "      -h|--help: print this help info and exit"
			echo "      -r|--requested_column: column to get averaged value from"
			echo "      -t|--table table_name"
			echo "Examples:"
			echo ""
			echo "		./hive_scripts.sh pull_with_one_liner -c a_column_name=a_value -d a_database_name -t a_table_name -r a_column_name -b 20120122 -e 20130122"
			echo ""
			;;		
	esac
	exit
}

# this will process command line arguments enough to get you to a specific function
args() {
	echo "processing command request"
    case $1 in
        pull_with_hql_script)
            shift
            pull_with_hql_script $@
            ;;
        pull_with_one_liner)
            shift
            pull_with_one_liner $@
            ;;

        *)
            echo >&2 "Invalid comand: $1"
            usage
        	;;
    esac
}


pull_with_hql_script() {
	# init params
	requested_column=""
	database=""
	table=""
	comparison=""
	hive_script="hive_script.hql"
	select=""
	begin_date=""
	end_date=""

	# process args for this block
	while test $# -gt 0
	do
    case $1 in
            -c|--comparison)
            	shift
            	comparison=$1
            	;;
            -d|--database)
            	shift
            	database=$1
            	;;
        	-h|--help)
            	usage pull_with_hql_script
            	;;
            -r|--requested_column)
            	shift
            	requested_column=$1
            	;;
            -t|--table)
            	shift
            	table=$1
            	;;
        	*)
            	echo >&2 "Invalid argument: $1"
            	usage "pull_with_hql_script"
        	    ;;
    	esac
    	shift
	done
	
	# determine if any option is missing	
	if [ x"$requested_column" == "x" ]; then
		echo "missing requested column: -r|--requested_column column_name_to_count"
		usage "pull_with_hql_script"
	fi
	
	if [ x"$database" == "x" ]; then
		echo "missing database name: -d|--database database_name"
		usage "pull_with_hql_script"
	fi
	
	if [ x"$table" == "x" ]; then
		echo "missing table name: -t|--table table_name"
		usage "pull_with_hql_script"
	fi
	
	if [ x"$comparison" == "x" ]; then
		echo "missing comparison clause WHERE : -c|--comparison comparison"
		usage "pull_with_hql_script"
	fi

	# set select statement
	select="AVG($requested_column)"
	
	# echo the command used to run the hive hql script just because
	echo "hive -S -hiveconf MY_SELECT=$select -hiveconf MY_COMPARISON=$comparison -hiveconf MY_DATABASE=$database -hiveconf MY_TABLE=$table -f $hive_script" 
	my_value=`hive -S -hiveconf MY_SELECT=$select -hiveconf MY_COMPARISON=$comparison -hiveconf MY_DATABASE=$database -hiveconf MY_TABLE=$table -f $hive_script`
	echo "returned value = $my_value"
	exit
	
}

pull_with_one_liner() {
	# init params
	requested_column=""
	distinct=""
	database=""
	table=""
	comparison=""
	hive_script="hive_script.hql"
	select=""
	begin_date=""
	end_date=""
	load=true

	# process args for this block
	while test $# -gt 0
	do
    case $1 in
    		-b|--begin_date)
            	shift
            	begin_date=$1
            	;;
            -c|--comparison)
            	shift
            	comparison=$1
            	;;
            -d|--database)
            	shift
            	database=$1
            	;;
            -e|--end_date)
            	shift
            	end_date=$1
            	;;
        	-h|--help)
            	usage pull_with_one_liner
            	;;
            -r|--requested_column)
            	shift
            	requested_column=$1
            	;;
            -t|--table)
            	shift
            	table=$1
            	;;
        	*)
            	echo >&2 "Invalid argument: $1"
            	usage "pull_with_one_liner"
        	    ;;
    	esac
    	shift
	done
	
	# determine if any option is missing	
	if [ x"$requested_column" == "x" ]; then
		echo "missing requested column: -r|--requested_column column_name_to_count"
		usage "pull_with_one_liner"
	fi
	
	if [ x"$database" == "x" ]; then
		echo "missing database name: -d|--database database_name"
		usage "pull_with_one_liner"
	fi
	
	if [ x"$table" == "x" ]; then
		echo "missing table name: -t|--table table_name"
		usage "pull_with_one_liner"
	fi
	
	if [ x"$comparison" == "x" ]; then
		echo "missing comparison clause WHERE <comparison>: -c|--comparison comparison"
		usage "pull_with_one_liner"
	fi
	
	if [ x"$begin_date" == "x" ]; then
		echo "missing start date <2012-01-28>: -b|--begin_date yyyy-mm-dd"
		usage "pull_with_one_liner"
	fi
	
	if [ x"$end_date" == "x" ]; then
		echo "missing ending date <2014-02-17>: -b|--begin_date yyyy-mm-dd"
		usage "pull_with_one_liner"
	fi
	
	# get start date and end date in correct format
	begin_date=`date --date="$begin_date" '+%Y%m%d'`
	end_date=`date --date="$end_date" '+%Y%m%d'`

	# set select statement in affect overriding any set by argument
	select="avg($requested_column)"
	
	# before we query for information we may want to make sure we have all the partitions loaded?
	# in this case the table was external so there may be partitions that were not loaded 
	# when the table was created
	# Here is an example of executing two commands with hive in a bash shell
	partitions=$(hive -e "USE $database; SHOW PARTITIONS $table")
	
	# loop through all the partitions and load if not present 
	# in most case this is not needed.
	my_date=$begin_date
	while test  $my_date -le $end_date
	do
		# Load partitions
		if [ "$load" = true ]; then
			# check if partition is present
			add_partition=true
			for partition in $partitions
			do
				my_timestamp=`date -d "$my_date" "+%s"`
				if [ "$partition" == "collectionday=$my_timestamp" ]; then
					add_partition=false;
				fi
			done
			if [ "$add_partition" = true ]; then
					# Here is an example of executing two statements in the hive shell
				echo "hive -e \"USE $database; alter table $table add if not exists partition(collectionday=$my_timestamp);\""
				hive -e "USE $database; alter table $table add if not exists partition(collectionday=$my_timestamp);"
			fi
		fi
		my_date=`date --date="$my_date + 1 day" '+%Y%m%d'`
	# end of while loop
	done
	
	# lets build the query we will execute in the hive shell
	my_query="set mapred.reduce.tasks=30;"
	my_query="$my_query SELECT $select"
	my_query="$my_query FROM ("
	my_query="$my_query SELECT DISTINCT collectiontime, id, collectionday, speed"
	my_query="$my_query FROM $database.$table"
	my_query="$my_query WHERE $comparison"
	my_query="$my_query ) t"
	my_query="$my_query GROUP by id, collectionday;"

	# echo the query passed to the hive shell just because
	echo "hive -S -e \"$my_query\"" 
	my_value=$(hive -e "$my_query")
	echo "returned value = $my_value"
	exit
	
}

# -------------------------------------------------------------------------------------
# Beginning of script execution
#

args $@

hive_script.hql

set mapred.reduce.tasks=30;
use ${hiveconf:MY_DATABASE};
SELECT ${hiveconf:MY_SELECT}
FROM ( SELECT DISTINCT collectiontime, id, collectionday, speed FROM ${hiveconf:MY_TABLE} WHERE ${hiveconf:MY_COMPARISON}) t
GROUP by id, collectionday;

Using the Hive shell inside scripts is a great way to avoid re-inventing the workflow. In this blog post, I’ve showed you how to use Hive in conjunction with an HQL script, as well as how to pass commands directly to Hive.

If you have any questions, please ask them in the comments section below.

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free