Drilling CSV files - A simple example

Apache Drill provides the capability to perform SQL queries on standard files in their native formats (such as comma-separated files or tab-separated files) without having to prebuild schemas around the data. Here is a quick example of how a .csv file can be queried using Drill without creating any metadata in Hive.

Here is some sample data we picked up from the web. This data comes from a government census website that provides numbers around retail total revenues and e-commerce revenues.

1st quarter 2014,1147388,71188,6.2,0.2,2.8,2.4,15.0
4th quarter 2013,1144544,69244,6.0,0.6,3.0,3.8,15.7
3rd quarter 2013,1137327,67250,5.9,1.0,3.5,4.7,17.6
2nd quarter 2013,1125990,64962,5.8,0.5,4.9,4.8,17.9
1st quarter 2013,1120085,61911,5.5,1.5,3.4,4.1,15.1
4th quarter 2012,1103156,59862,5.4,1.5,4.7,4.4,13.7
3rd quarter 2012,1086483,57196,5.3,1.2,3.8,4.7,15.8
2nd quarter 2012,1073961,55095,5.1,-0.2,2.4,4.5,14.2
1st quarter 2012,1075649,53788,5.0,1.8,2.2,6.3,15.5
4th quarter 2011,1056364,52628,5.0,1.8,6.6,6.9,16.9
3rd quarter 2011,1037304,4….

For your reference the comma separation refers to the following fields: Quarter, Retail sales total, Retail sales e-commerce, e-commerce as a % change of total, % change from prior quarter for total, % change from prior quarter for e-commerce, % change same quarter prior year for total, % change same quarter prior year for e-commerce

Once we drop the csv file onto Hadoop, the first query we run is the following. Note that the term dfs points to the filesystem path where drill bits can find the data. Here you can point to “\t” separated data or “|” separated data. This path can also be set to talk to different sources of data including HBase tables.

select * from dfs.`/tmp/mydata/census.csv`;
+------------+
|  columns   |
+------------+
| ["1st quarter 2014","1147388","71188","6.2","0.2","2.8","2.4","15.0"] |
| ["4th quarter 2013","1144544","69244","6.0","0.6","3.0","3.8","15.7"] |
| ["3rd quarter 2013","1137327","67250","5.9","1.0","3.5","4.7","17.6"] |
| ["2nd quarter 2013","1125990","64962","5.8","0.5","4.9","4.8","17.9"] |
| ["1st quarter 2013","1120085","61911","5.5","1.5","3.4","4.1","15.1"] |
| ["4th quarter 2012","1103156","59862","5.4","1.5","4.7","4.4","13.7"] |
| ["3rd quarter 2012","1086483","57196","5.3","1.2","3.8","4.7","15.8"] |
| ["2nd quarter 2012","1073961","55095","5.1","-0.2","2.4","4.5","14.2"] |
| ["1st quarter 2012","1075649","53788","5.0","1.8","2.2","6.3","15.5"] |
| ["4th quarter 2011","1056364","52628","5.0","1.8","6.6","6.9","16.9"] |
| …..
…..
…..
["1st quarter 2000","740469","5806","0.8","2.3","30.4","9.3","NA"] |
| ["4th quarter 1999","723520","4454","0.6","2.2","NA","9.0","NA"] |
+------------+
58 rows selected (0.365 seconds)

Here we notice that all 58 records were retrieved by Drill within a second without any need for centralized schema. Now lets get just one level deeper and put in a where clause: Here we are trying to get all 4th quarter sales from the file and listing all values that do not have an NA in the last entry.

select * from dfs.`/tmp/mydata/census.csv` where columns[0] like '%4th%' and columns[7] <> 'NA';
+------------+
|  columns   |
+------------+
| ["4th quarter 2013","1144544","69244","6.0","0.6","3.0","3.8","15.7"] |
| ["4th quarter 2012","1103156","59862","5.4","1.5","4.7","4.4","13.7"] |
| ["4th quarter 2011","1056364","52628","5.0","1.8","6.6","6.9","16.9"] |
| ["4th quarter 2010","988286","45006","4.6","3.1","4.7","6.9","18.0"] |
| ["4th quarter 2009","924159","38155","4.1","0.7","3.3","1.1","15.0"] |
| ["4th quarter 2008","914045","33188","3.6","-8.8","-7.5","-10.0","-7.5"] |
| ["4th quarter 2007","1015228","35888","3.5","1.1","3.9","4.2","19.0"] |
| ["4th quarter 2006","974676","30148","3.1","0.1","5.2","3.9","24.1"] |
| ["4th quarter 2005","938329","24292","2.6","0.5","3.2","5.3","24.0"] |
| ["4th quarter 2004","891125","19588","2.2","2.6","6.4","7.3","26.1"] |
| ["4th quarter 2003","830759","15536","1.9","0.4","5.4","5.0","27.3"] |
| ["4th quarter 2002","791377","12203","1.5","0.4","6.5","0.8","31.5"] |
| ["4th quarter 2001","784983","9283","1.2","3.6","12.3","4.4","19.5"] |
| ["4th quarter 2000","752091","7765","1.0","0.7","5.9","3.9","74.3"] |
+------------+x	
14 rows selected (0.343 seconds)

As you can see, we have used the columns[] array to process the where clause on our flat file. As the last step, lets try to now order the results by the percentage of e-commerce contributions greater than 5% points.

select columns[0], columns[3] from dfs.`/tmp/mydata/census.csv` where columns[3] > ‘5’ order by columns[3] asc; 

+------------+------------+
|   EXPR$1   |   EXPR$0   |
+------------+------------+
| 5.0        | 1st quarter 2012 |
| 5.0        | 4th quarter 2011 |
| 5.1        | 2nd quarter 2012 |
| 5.3        | 3rd quarter 2012 |
| 5.4        | 4th quarter 2012 |
| 5.5        | 1st quarter 2013 |
| 5.8        | 2nd quarter 2013 |
| 5.9        | 3rd quarter 2013 |
| 6.0        | 4th quarter 2013 |
| 6.2        | 1st quarter 2014 |
+------------+------------+
10 rows selected (3.838 seconds)

As this example illustrates, Drill makes it extremely easy to start querying data as it arrives without any DBA intervention. Look out for more cool examples on how Drill can help query other types of data. Learn more about Drill here on the Drill Wiki pages.

Tags
Drill
SQL
Query Editor
Hive
Beginner