Drilling into Healthy Choices

Part one of the two-part series: Rethinking Database Design with Apache Drill

Drill is a SQL-engine for everything (almost). From simple tabular data, to semi-structured to even the most complex structured JSON data. In this two part series we will explore what Apache Drill can do and how it enables us to rethink database design to make everyone's life easier. In this first installment, we will use the National Nutrient Database (NNDB) to illustrate how to explore a new database with Drill. The NNDB is convenient to use and an openly available database, and, when coupled with Drill, you can use it without having to be a software engineer. There is no need to figure out how to transform and load data into a relational database just to query the data. You can even use Drill as a datasource within any application (e.g. Tableau, Microsoft Excel) which can connect to a database through a standard interface. Let’s get started.

USDA National Nutrient Database
The National Nutrient Database (NNDB) is provided by the USDA to allow the public to get standardized information on foods. They document the database and all supporting information on the USDA website. Each release of the database and documentation receives a version number. At the time this article is published the current release is referred to as Standard Reference, Release 27 or SR27. The table structures / file definitions do not appear to change dramatically over time. It is quite likely that the queries provided here will continue to work with future releases with little or no modifications.

The entire database can be downloaded in a zipped package of ASCII files. In order to follow along in a hands-on manner with the rest of this article / tutorial, you will want to download the zip file and extract the files to /tmp (or equivalent). Drill has a workspace named tmp which references the systems respected temporary folder. When unzipping the files, ensure they are extracted into a folder named sr27asc. All of the commands below reference the tmp workspace and that folder name (e.g. /tmp/sr27asc).

Starting and Configuring Drill to Handle the NNDB
To start your Drill instance you can use either of these two command lines:

  1. sqlline -u jdbc:drill:zk=local
  2. drill-embedded # this is just a short cut for the previous line

First and foremost, it will be important for you to know that you can configure Drill via the web interface. The Drill configuration can be accessed at http://localhost:8047/ after you start your Drill instance.

The NNDB uses characters that most people don’t use for delimited file formats. The fields are separated by carets (^) and text fields are surrounded by tildes (~). Because of this we need to add a little bit of configuration to enable Drill to properly read / parse these files. Paste this file format definition to the formats section of the dfs storage definition which can be found in Drill’s web interface under the storage menu option:

   "nndb": {
   "type": "text",
   "extensions": [ "txt" ],
   "quote": "~",
   "escape": "~",
   "delimiter": "^"
   },

 

Next, let’s convert these tables into Parquet format to simplify the queries. As we do that, we can also add a workspace definition that has parquet as a default format and allows tables to be created. Workspaces are abbreviations are a way to access files or tables. Paste the definition below into the workspaces section of your dfs storage definition. Make sure that you change the location parameter to the directory where you want to store your tables.

   "nndb": {
   "location": "/opt/drill/nndb",
   "writable": true,
   "storageformat": "parquet"
   },

 

Creating Tables from Delimited Files
For Drill, delimited files are really just tables. They contain rows of data in a columnar form. Delimited files are very simple, but definitely not the most efficient format if you are going to be querying large volumes of data regularly. Delimited files can work great, but if size or speed ever become a concern you will be quickly be looking for alternative storage formats. For this use case, we are going to convert all of the delimited files to Parquet format with Drill.

The queries below do this conversion for each of the data files in the NNDB. Note how the source data columns do not have names and are referenced as columns[N], but are named as part of the create table query. Having named columns makes queries cleaner, easier to read and to understand. The table and column names we use here are taken from the original documentation for this dataset.

-- Referenced from Page 30 of the sr27_doc.pdf
create table dfs.nndb.food_des(NDB_No, FdGrp_Cd, Long_Desc, Shrt_Desc, ComName, ManufacName, Survey, Ref_desc, Refuse, SciName, N_Factor, Pro_Factor, Fat_Factor, CHO_Factor) as select columns[0], columns[1], columns[2], columns[3], columns[4], columns[5], columns[6], columns[7], columns[8], columns[9], columns[10], columns[11], columns[12], columns[13] from dfs.tmp.`sr27asc/FOOD_DES.txt`;

-- Referenced from Page 31 of the sr27_doc.pdf 
create table dfs.nndb.fd_group(FdGrp_Cd, FdGrp_Desc) as select columns[0], columns[1] from dfs.tmp.`sr27asc/FD_GROUP.txt`;

-- Referenced from Page 31 of the sr27_doc.pdf
create table dfs.nndb.langual(NDB_No, Factor_Code) as select columns[0], columns[1] from dfs.tmp.`sr27asc/LANGUAL.txt`;

-- Referenced from Page 31-32 of the sr27_doc.pdf
create table dfs.nndb.langdesc(Factor_Code, Description) as select columns[0], columns[1] from dfs.tmp.`sr27asc/LANGDESC.txt`;

-- Referenced from Page 32-33 of the sr27_doc.pdf 
create table dfs.nndb.nut_data(NDB_No, Nutr_No, Nutr_Val, Num_Data_Pts, Std_Error, Src_Cd, Deriv_Cd, Ref_NDB_No, Add_Nutr_Mark, Num_Studies, `Min`, `Max`, DF, Low_EB, Up_EB, Stat_cmt, AddMod_Date, CC) as select columns[0], columns[1], columns[2], columns[3], columns[4], columns[5], columns[6], columns[7], columns[8], columns[9], columns[10], columns[11], columns[12], columns[13], columns[14], columns[15], columns[16], columns[17] from dfs.tmp.`sr27asc/NUT_DATA.txt`;

-- Referenced from Page 34 of the sr27_doc.pdf 
create table dfs.nndb.nutr_def(Nutr_No, Units, Tagname, NutrDesc, Num_Dec, SR_Order) as select columns[0], columns[1], columns[2], columns[3], columns[4], columns[5] from dfs.tmp.`sr27asc/NUTR_DEF.txt`;

-- Referenced from Page 34-35 of the sr27_doc.pdf 
create table dfs.nndb.src_cd(Src_Cd, SrcCd_Desc) as select columns[0], columns[1] from dfs.tmp.`sr27asc/SRC_CD.txt`;

-- Referenced from Page 35 of the sr27_doc.pdf
create table dfs.nndb.deriv_cd(Deriv_Cd, Deriv_Desc) as select columns[0], columns[1] from dfs.tmp.`sr27asc/DERIV_CD.txt`;

-- Referenced from Page 36 of the sr27_doc.pdf 
create table dfs.nndb.weight(NDB_No, Seq, Amount, Msre_Desc, Gm_Wgt, Num_Data_Pts, Std_Dev) as select columns[0], columns[1], columns[2], columns[3], columns[4], columns[5], columns[6] from dfs.tmp.`sr27asc/WEIGHT.txt`;

-- Referenced from Page 36-37 of the sr27_doc.pdf 
create table dfs.nndb.footnote(NDB_No, Footnt_No, Footnt_Typ, Nutr_No, Footnt_Txt) as select columns[0], columns[1], columns[2], columns[3], columns[4] from dfs.tmp.`sr27asc/FOOTNOTE.txt`;

-- Referenced from Page 37 of the sr27_doc.pdf
create table dfs.nndb.datsrcln(NDB_No, Nutr_No, DataSrc_ID) as select columns[0], columns[1], columns[2] from dfs.tmp.`sr27asc/DATSRCLN.txt`;

-- Referenced from Page 37-38 of the sr27_doc.pdf 
create table dfs.nndb.data_src(DataSrc_ID, Authors, Title, `Year`, Journal, Vol_City, Issue_State, Start_Page, End_Page) as select columns[0], columns[1], columns[2], columns[3], columns[4], columns[5], columns[6], columns[7], columns[8] from dfs.tmp.`sr27asc/DATA_SRC.txt`;

If you put all these queries in a file called nndb.sql, then you could have Drill run all of these commands at one time: drill-embedded --run=nndb.sql

Otherwise, just paste them all into the command line and they will run in order. This should take about 15 seconds.

Working with the Data
Now that we have the data in a convenient form, we are free to do anything our hearts desire with this dataset. From inside Drill, you can refer to these tables as dfs.nndb.TABLE_NAME. This means that the new database files which contain the data from the original files is now in the directory you associated with the nndb workspace when you configured the nndb workspace.

Here is an entity relationship diagram which should hopefully make querying this data a little easier.

 

When starting to work with a new dataset which I do not fully understand, I tend to explore the data and do what some people call introspection. This is really helpful for getting a feel for the data. When doing this, I commonly limit all of my queries to return only about 20 lines of data. This makes it much easier to get an understanding of the data rather than potentially seeing thousands of lines scrolling by.

In this dataset the first thing to point is that the Food Description (food_des) table and the Nutritional Data (nut_data) are the hubs for all the data in this database. The complete definitions for all the fields in these tables can be found in theSR27 document.

The langual table and the langdesc table contain the generalized definitions of the items listed in the Food Description table. Run this query to join the two tables together to get an idea of the data in those tables:

   SELECT lf.NDB_No, ld.Description FROM dfs.nndb.langual lf, dfs.nndb.langdesc ld 
   where lf.FACTOR_CODE=ld.FACTOR_CODE limit 20;

 

This query could have been written with the JOIN USING syntax:

   SELECT lf.NDB_No, ld.Description FROM dfs.nndb.langual lf LEFT JOIN 
   dfs.nndb.langdesc ld USING (FACTOR_CODE) limit 20;

 

There are footnotes related to food descriptions and there are footnotes related to nutrition data. These tables can be joined using this query:

   SELECT * FROM dfs.nndb.nut_data nd, dfs.nndb.footnote fn where
   nd.ndb_no=fn.ndb_no and nd.nutr_no=fn.nutr_no and nd.Nutr_No is not null limit 10;

 

The final query we will use will cover joining the core tables to get a better view of the data:

   SELECT * FROM
   dfs.nndb.food_des fd
   LEFT JOIN dfs.nndb.fd_group fg ON (fg.FdGrp_Cd=fd.FdGrp_Cd)
   LEFT JOIN dfs.nndb.weight w ON (fd.NDB_No=w.NDB_No)
   LEFT JOIN dfs.nndb.nut_data nd ON (fd.NDB_No=nd.NDB_No)
   LEFT JOIN dfs.nndb.src_cd sc ON (nd.Src_Cd=sc.Src_Cd)
   LEFT JOIN dfs.nndb.nutr_def ndd ON (nd.Nutr_No=ndd.Nutr_No)
   LEFT JOIN dfs.nndb.deriv_cd dc ON (nd.Deriv_Cd=dc.Deriv_Cd)
   ORDER BY fd.NDB_No ASC limit 20;

 

With this query we can see the data really starts to take shape. It becomes a little easier to see what is in this database when everything is joined together.

What’s Next?
Now that you have an understanding of how to query this database, you can use it to explore the different nutritional information for products you consume or have considered trying. You could calculate the complete nutritional information from a recipe card.

The NNDB is convenient to use and an openly available database, and, when coupled with Drill, you can use it without having to be a software engineer. There is no need to figure out how to transform and load data into a relational database just to query the data. You can even use Drill as a datasource within any application (e.g. Tableau, Microsoft Excel) which can connect to a database through a standard interface using JDBC or ODBC connectors.

You don’t have to transform the data in order to use Drill against your datasource. If a data set is too large you can go from running Drill queries on your laptop to running them on your cluster. Drill also supports joining across data sources at the same time (e.g. join csv with parquet with json with hbase).

In part two of this series we will dig deeper into this dataset and look at more complex functionality that is available for use from within Drill. We will also explore how Drill enables us to evolve to a SQL + NoSQL approach.

Do you have a great new idea for how to leverage Drill or even this dataset? Please share it below for everyone to read. I would love to hear about them.

no

CTA_Inside

Delivering Fastest Time-to-Value for SQL-on-Hadoop
Read this paper to learn about: How Drill enables self-service data analysis, An example scenario - analyzing Twitter JSON data with Drill, How Drill compares to Hive and Impala.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free