Teradata Connector for Hadoop with YARN on MapR

HTML Editor - Full Version

Teradata Connector for Hadoop (TDCH) is a key component of Teradata’s Unified Data Architecture for moving data between Teradata and Hadoop. TDCH invokes a mapreduce job on the Hadoop cluster to push/pull data to/from Teradata databases, with each mapper moving a portion of the data, in parallel across all nodes, for very fast transfers. Use the Apache 2.6 version of TDCH for use with YARN on MapR 4.1 and later.

This post provides detailed examples for:

  • Bi-directional transfers between Hadoop files and Teradata tables
  • Bi-directional transfers between Hive tables and Teradata tables

Test Environment

I used a TeradataExpress 15.0 virtual machine, a 3 node MapR 4.1 cluster, and a MapR Linux client for the examples in this blog.  If you don’t already have a Teradata environment, you can download a TeradataExpress VM from the Teradata Developer’s Exchange. For simplicity, I have the MapR cluster NFS mounted by the client so I don’t need to copy files between hadoop and the local linux file system. I just keep everything on the NFS mounted hadoop file system.

Teradata Database

TeradataExpress 15.0 VM

TDCH

Command Line Edition 1.3.4 for Apache Hadoop 2.6

MapR cluster

Enterprise Edition 4.1 with Hive 0.13, Hiveserver2

MapR client

Linux 4.1 client with MapR cluster NFS mounted at /mapr

Installation

TDCH must be installed on a system that can submit mapreduce jobs. I installed it on a MapR 4.1 Linux client node, but you could also install it on a cluster node. You'll need to create a free Teradata Developer Exchange account to download TDCH. Once you're logged in to Developer Exchange, go to the Teradata Connector for Hadoop (Command Line Edition) download page and select Teradata Connector for Hadoop Apache 2.6 to download the TDCH 1.3.4 rpm.

After copying the rpm to the MapR client node, install it as root:

(In the code boxes, output is in Blue)

# rpm -ivh teradata-connector-1.3.4-hadoop2.6.noarch.rpm

Preparing...                ########################################### [100%]

  1:teradata-connector     ########################################### [100%]

Installation of TDCH 1.3.4 into directory /usr/lib/tdch/1.3 complete

Now that you have TDCH installed, you can view the extensive documentation in /usr/lib/tdch/1.3/README. Much more than a typical README, this document provides complete documentation on the many TDCH options as well as several examples.  

Scripts

With some minor modifications, I’ve put the hive and hdfs examples from the TDCH README into easily configurable scripts so you can start using TDCH in your environment.  The scripts are available on github at https://github.com/lerner/tdch-examples.  Log in as a non-privileged user (not mapr or root), and download the scripts.

The scripts can be used with your MapR and Teradata setup by exporting the following shell environment variables. Alternatively, the default values for these variables can be modified in tdch_env.sh.  None of the other scripts need to be changed.

Environment Variable

Description

TD_SERVER

The resolvable hostname or IP address of the Teradata system or virtual machine.

TD_DBC_PWD

The Teradata dbc password. The default “dbc” will work with the Teradata Express virtual machine. This password is only used to create the TD_USER below.  If you don’t want to put your dbc password in a script, you can create the TD_USER manually and ignore errors from tdch_setup.sh.

TD_USER

The Teradata database/user. The tdch_setup script will create this user. The default value will work fine assuming you don’t already have a “sampleuser” defined.

TD_PWD

The password that will be set for the TD_DATABASE user when it is created.  The default value will work fine but can be changed if desired.

TD_TABLE

The name of the Teradata table that will be created. If not modified, the name sales_transaction will be used as the table name.

TD_ADMIN

This is the Teradata Linux administrative user on the Teradata server or virtual machine.

HADOOP_USER

The Linux user that will be submitting the TDCH job. This must be a standard user who can run mapreduce jobs, not root or mapr.

HADOOP_PWD

Password for HADOOP_USER.

CLUSTERNAME

The MapR cluster name, a directory under /mapr on the MapR client where the cluster can be accessed via NFS.

HIVE_HOME

The MapR Hive directory, /opt/mapr/hive/hive-<version>.

HIVESERVER2

The MapR node running hiveserver2.

Script

Description

tdch_env.sh

Defines environment variables to specify the Teradata database server, as well as user names and passwords. It also contains a function for creating the sample sales transaction data file for the examples.  All of the other provided scripts source this script to set their environment.

You must export the environment variables defined in this script (or modify the default values) to correspond to your setup.

tdch_setup.sh

Creates a user and sales transaction table on the Teradata database. Also creates the sales transaction data file.

tdch_hdfs.sh

TDCH hdfs examples from the README file. Each function hdfs_N corresponds to example N in section 7.1.1.1 of the TDCH README file.  tdch_setup.sh must be run before any functions in tdch_hdfs.sh are executed.

tdch_hive.sh

TDCH hive examples from the README file. Each function hive_N corresponds to example N in section 7.1.1.2 of the TDCH README file. tdch_setup.sh must be run before any functions in tdch_hive.sh are executed.

tdch_cleanup.sh

Drop the Teradata table and user created by tdch_setup.sh.  

Setting environment variables

The tdch_env.sh script is called by the other scripts to set environment variables. It will use exisiting values if they are already set in your environment. If not set, it will set default values as defined in tdch_env.sh.

For example, the Teradata server is defined with the the TD_SERVER environment variable. If your Teradata server is tera1.mycompany.com, you can set export TD_SERVER with this value before running any of the scripts.

[mruser ~]$ export TD_SERVER=tera1.mycompany.com

Alternatively, you can edit tdch_env.sh to set the default value for this variable.

[mruser ~]$ vi tdch_env.sh

[mruser ~]$ grep TD_SERVER tdch_env.sh

TD_SERVER=${TD_SERVER:-tera1.mycompany.com}    # Teradata server (ip address or hostname)

Modify other variables listed in the table above to correspond to your environment.

Setting up the Teradata database

After setting up your environment variables, log in to the MapR client as $HADOOP_USER (mruser in this example), and run tdch_setup.sh to create the Teradata user and table that TDCH will use for data transfers. You should see successful creation of the Teradata user and table.

[mruser ~]$ ./tdch_setup.sh

TDCH: Logging in to teradata@10.10.99.101

...

create user sampleuser as password=samplepwd  perm=524288000 spool=52428800

0;

*** User has been created.

...

CREATE MULTISET TABLE sales_transaction (

   tran_id INTEGER,

   tran_date DATE,

   customer VARCHAR(100),

   amount DECIMAL(18,2)

);

*** Table has been created.

...

The tdch_setup.sh script assumes the following:

  1. Teradata database is running on $TD_SERVER.

  2. Teradata database user $TD_USER has been created OR user dbc has password $TD_DBC_PWD.

  3. Passwordless ssh is configured for $HADOOP_USER to $TD_ADMIN user on $TD_SERVER.

  4. MapR NFS to cluster mounted at /mapr/$CLUSTERNAME.

  5. Linux user $HADOOP_USER is running this script and has write permission to /mapr/$CLUSTERNAME/user/$HADOOP_USER.

HDFS

The tdch_hdfs.sh script contains bash functions to import and export data directly to/from the Hadoop file system.  Make sure that tdch_setup.sh was run successfully before running the hdfs examples.

Export to Teradata

Export data from the sales transaction file (stfile) in Hadoop to the Teradata sales transaction table using hdfs example 2.  The hadoop jar command specifies the TDCH ConnectorExportTool along with parameters including the hadoop source and Teradata target.  TDCH will invoke a mapreduce job to export the data and the ConnectorExportTool will have an exit code of 0 upon successful completion.   Invoke this example with the command line “./tdch_hdfs.sh hdfs_2” in the directory where you’ve downloaded the scripts from github.

[mruser ~]$ ./tdch_hdfs.sh hdfs_2

TDCH: Load Hadoop file /mapr/scale61/user/mruser/data/stfile into Teradata table sales_transaction

15/06/29 17:41:49 INFO tool.ConnectorExportTool: ConnectorExportTool starts at 1435624909061

...

15/06/29 17:41:52 INFO mapreduce.Job: Running job: job_1431996639188_0187

15/06/29 17:41:58 INFO mapreduce.Job: Job job_1431996639188_0187 running in uber mode : false

15/06/29 17:41:58 INFO mapreduce.Job:  map 0% reduce 0%

15/06/29 17:42:04 INFO mapreduce.Job:  map 50% reduce 0%

15/06/29 17:42:05 INFO mapreduce.Job:  map 100% reduce 0%

15/06/29 17:42:06 INFO mapreduce.Job: Job job_1431996639188_0187 completed successfully

...

15/06/29 17:42:07 INFO tool.ConnectorExportTool: job completed with exit code 0

Import to Hadoop

Import data from the Teradata sales transaction table to a Hadoop directory using hdfs example 1.  The hadoop jar command specifies the TDCH ConnectorImportTool along with parameters including the Teradata source table and Hadoop target directory.  TDCH will invoke a mapreduce job to import the data and the ConnectorImportTool will have an exit code of 0 upon successful completion.  Invoke this example with the command line “./tdch_hdfs.sh hdfs_1” in the directory where you’ve downloaded the scripts from github.

[mruser ~]$ ./tdch_hdfs.sh hdfs_1

TDCH: Copy Teradata table sales_transaction to Hadoop directory /user/mruser/data/sales_transaction

...

Deleted /user/mruser/data/sales_transaction

15/06/29 18:15:59 INFO tool.ConnectorImportTool: ConnectorImportTool starts at 1435626959583

...

15/06/29 18:16:03 INFO mapreduce.Job: Running job: job_1431996639188_0188

15/06/29 18:16:10 INFO mapreduce.Job: Job job_1431996639188_0188 running in uber mode : false

15/06/29 18:16:10 INFO mapreduce.Job:  map 0% reduce 0%

15/06/29 18:16:16 INFO mapreduce.Job:  map 50% reduce 0%

15/06/29 18:16:17 INFO mapreduce.Job:  map 100% reduce 0%

15/06/29 18:16:17 INFO mapreduce.Job: Job job_1431996639188_0188 completed successfully

...

15/06/29 18:16:17 INFO tool.ConnectorImportTool: job completed with exit code 0

Hive

The hive examples in tdch_hive.sh assume that you have hiveserver2 with impersonation enabled on your cluster and that you have hive installed on your client so you can run beeline to access the hive database.  Before running tdch_hive.sh, ensure the following:

  1. tdch_setup.sh script has completed successfully

  2. Hiveserver2 is running on server $HIVESERVER2 with impersonation enabled.

  3. Hive is installed on the local server in $HIVE_HOME

  4. $HADOOP_USER has write permission to hive warehouse directory /user/hive/warehouse.

Export to Teradata

Export data from the hive sales transaction table to the Teradata sales transaction table using hive example 4.  This example will use beeline to drop, create, and load the hive sales_transaction table with sample data.  The hadoop jar command specifies the TDCH ConnectorExportTool along with parameters including the hive source and Teradata target.  TDCH will invoke a mapreduce job to export the data and the ConnectorExportTool will have an exit code of 0 upon successful completion.   Invoke this example with the command line “./tdch_hive.sh hive_4” in the directory where you’ve downloaded the scripts from github.

[mruser ~]$ ./tdch_hive.sh hive_4

drwxrwxrwt   - mapr mapr          3 2015-06-30 15:07 /user/hive/warehouse

TDCH: Drop and load Hive table sales_transaction with data from /mapr/scale61/user/mruser/data/stfile

...

TDCH: Export 3 columns from hive table sales_transaction to Teradata table sales_transaction

15/06/30 15:09:35 INFO tool.ConnectorExportTool: ConnectorExportTool starts at 1435702175240

...

15/06/30 15:09:59 INFO mapreduce.Job:  map 100% reduce 0%

15/06/30 15:09:59 INFO mapreduce.Job: Job job_1431996639188_0190 completed successfully

...

15/06/30 15:10:00 INFO tool.ConnectorExportTool: job completed with exit code 0

Import to Hive

Import data from the Teradata sales transaction table to a Hadoop directory using hive example 2.  This example will use beeline to drop an existing hive sales_transaction table.  The hadoop jar command specifies the TDCH ConnectorImportTool along with parameters including the Teradata source table and the Hive target table and schema.  TDCH will invoke a mapreduce job to import the data and create a hive sales_transaction table. The ConnectorImportTool will have an exit code of 0 upon successful completion.  Invoke this example with the command line “./tdch_hive.sh hive_2” in the directory where you’ve downloaded the scripts from github.

[mruser ~]$ ./tdch_hive.sh hive_2

drwxrwxrwt   - mapr mapr          3 2015-06-30 15:20 /user/hive/warehouse

TDCH: Drop hive sales_transaction table

...

TDCH: Import 3 columns from Teradata table sales_transaction to hive sales_transaction table

15/06/30 15:27:00 INFO tool.ConnectorImportTool: ConnectorImportTool starts at 1435703220163

...

15/06/30 15:27:13 INFO mapreduce.Job:  map 0% reduce 0%

15/06/30 15:27:19 INFO mapreduce.Job:  map 50% reduce 0%

15/06/30 15:27:22 INFO mapreduce.Job:  map 100% reduce 0%

15/06/30 15:27:22 INFO mapreduce.Job: Job job_1431996639188_0192 completed successfully

...

15/06/30 15:27:24 INFO tool.ConnectorImportTool: job completed with exit code 0

The tdch_hive.sh script contains additional examples from the TDCH README file that you can use in to better understand TDCH features.

Cleaning up

The tdch_cleanup.sh script will remove the sales transaction data files and the Teradata database that was created by tdch_setup.sh.

Summary

In this blog post, you’ve implemented examples to transfer data between Hadoop and Teradata using the Teradata Connector for Hadoop. For details on TDCH parameters, see the TDCH README file. If you have any further questions, or want to share how you are using Teradata with MapR, please add your comments in the section below.

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free