A very common use case when working with Hadoop is to store and query simple files (such as CSV or TSV), and then to convert these files into a more efficient format such as Apache Parquet in order to achieve better performance and more efficient storage. In this blog post, I'll show you how to convert a CSV file to Apache Parquet using Apache Drill.
- Columnar format
Let's take a concrete example: there are many interesting open data sources that distribute data as CSV files (or equivalent format). You can store these CSV files in your distributed file system and use them in your applications/jobs/analytics queries. However, this not the most efficient method, especially since this data won't move that often. So instead of simply storing the CSV files, let's copy this information into Parquet.
How to convert CSV files into Parquet files
You can use code to achieve this, as you can see in the ConvertUtils sample/test class. A simpler method for converting CSV files is to use Apache Drill, which lets you save the result of a query as a Parquet file.
Follow the steps below to convert a simple CSV into a Parquet file using Drill:
- Apache Drill : Standalone Apache Drill or use Apache Drill Sandbox from MapR
- Some CSV files, such as the Passenger Dataset from SFO Air Traffic Statistics
Querying the CSV file
Let's execute a basic query:
SELECT * FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/MonthlyPassengerData_200507_to_201503.csv` LIMIT 5; ["200507","ATA Airlines","TZ","ATA Airlines","TZ","Domestic","US","Deplaned","Low Fare","Terminal 1","B","27271\r"] ... ...
As you can see, by default, Drill processes each line as an array of columns, all values being simple string. So if you need to do some operations with these values (projection or where clause) you must use the column index, and cast the value to the proper type. You can see a simple example below:
SELECT columns as `DATE`, columns as `AIRLINE`, CAST(columns AS DOUBLE) as `PASSENGER_COUNT` FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/*.csv` WHERE CAST(columns AS DOUBLE) < 5 ;
We are now ready to create our Parquet files using the "Create Table As Select" (aka CTAS):
alter session set `store.format`='parquet';
CREATE TABLE dfs.tmp.`/stats/airport_data/` AS SELECT CAST(SUBSTR(columns,1,4) AS INT) `YEAR`, CAST(SUBSTR(columns,5,2) AS INT) `MONTH`, columns as `AIRLINE`, columns as `IATA_CODE`, columns as `AIRLINE_2`, columns as `IATA_CODE_2`, columns as `GEO_SUMMARY`, columns as `GEO_REGION`, columns as `ACTIVITY_CODE`, columns as `PRICE_CODE`, columns as `TERMINAL`, columns as `BOARDING_AREA`, CAST(columns AS DOUBLE) as `PASSENGER_COUNT` FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/*.csv`
That's it! You now have a Parquet file, which is a single file in our case, since the dataset is really small. Apache Drill will create multiples files for the tables, depending on the size and configuration your environment.
I invite you to read this chapter in the Apache Drill documentation to learn more about Drill and Parquet.
Query Parquet Files
Now that you have created your Parquet files, you can use them in any of your Hadoop processes. You can also use the Parquet files in Drill, as follows:
SELECT * FROM dfs.tmp.`/stats/airport_data/*`
In this article, you learned how to convert a CSV file to Apache Parquet using Apache Drill. Keep in mind that you can do this with any source supported by Drill (for example, from JSON to Parquet), or even a complex join query between multiple data sources. You can also chose a different output format, such as JSON or a CSV.
Do you have any questions or comments about this process? Ask them in the comments section below.