How to Convert a CSV file to Apache Parquet Using Apache Drill

A very common use case when working with Hadoop is to store and query simple files (such as CSV or TSV), and then to convert these files into a more efficient format such as Apache Parquet in order to achieve better performance and more efficient storage. In this blog post, I'll show you how to convert a CSV file to Apache Parquet using Apache Drill.

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem. Apache Parquet has the following characteristics:

  • Self-describing
  • Columnar format
  • Language-independent

Let's take a concrete example: there are many interesting open data sources that distribute data as CSV files (or equivalent format). You can store these CSV files in your distributed file system and use them in your applications/jobs/analytics queries. However, this not the most efficient method, especially since this data won't move that often. So instead of simply storing the CSV files, let's copy this information into Parquet.

How to convert CSV files into Parquet files

You can use code to achieve this, as you can see in the ConvertUtils sample/test class. A simpler method for converting CSV files is to use Apache Drill, which lets you save the result of a query as a Parquet file.

Follow the steps below to convert a simple CSV into a Parquet file using Drill:


Querying the CSV file
Let's execute a basic query:

FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/MonthlyPassengerData_200507_to_201503.csv`

["200507","ATA Airlines","TZ","ATA Airlines","TZ","Domestic","US","Deplaned","Low Fare","Terminal 1","B","27271\r"]

As you can see, by default, Drill processes each line as an array of columns, all values being simple string. So if you need to do some operations with these values (projection or where clause) you must use the column index, and cast the value to the proper type. You can see a simple example below:

columns[0] as `DATE`,
columns[1] as `AIRLINE`,
FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/*.csv`
WHERE CAST(columns[11] AS DOUBLE) < 5

We are now ready to create our Parquet files using the "Create Table As Select" (aka CTAS):

alter session set `store.format`='parquet';
CREATE TABLE dfs.tmp.`/stats/airport_data/` AS
CAST(SUBSTR(columns[0],1,4) AS INT)  `YEAR`,
CAST(SUBSTR(columns[0],5,2) AS INT) `MONTH`,
columns[1] as `AIRLINE`,
columns[2] as `IATA_CODE`,
columns[3] as `AIRLINE_2`,
columns[4] as `IATA_CODE_2`,
columns[5] as `GEO_SUMMARY`,
columns[6] as `GEO_REGION`,
columns[7] as `ACTIVITY_CODE`,
columns[8] as `PRICE_CODE`,
columns[9] as `TERMINAL`,
columns[10] as `BOARDING_AREA`,
FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/*.csv`

That's it! You now have a Parquet file, which is a single file in our case, since the dataset is really small. Apache Drill will create multiples files for the tables, depending on the size and configuration your environment.

I invite you to read this chapter in the Apache Drill documentation to learn more about Drill and Parquet.

Query Parquet Files

Now that you have created your Parquet files, you can use them in any of your Hadoop processes. You can also use the Parquet files in Drill, as follows:

FROM dfs.tmp.`/stats/airport_data/*`


In this article, you learned how to convert a CSV file to Apache Parquet using Apache Drill. Keep in mind that you can do this with any source supported by Drill (for example, from JSON to Parquet), or even a complex join query between multiple data sources. You can also chose a different output format, such as JSON or a CSV.

Do you have any questions or comments about this process? Ask them in the comments section below.



Driving The Next Generation Data Architecture with Hadoop
This paper examines the emergence of Hadoop as an operational data platform, and how complementary data strategies and year-over-year adoption can accelerate consolidation and realize business value.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free