Featured Author

Nicolas A Perez
Software engineer, IPC

Nicolas is a software engineer at IPC, an independent SUBWAY® franchisee-owned and operated purchasing cooperative, where I work on their Big Data Platform. Very interested in Apache Spark, Hadoop, distributed systems, algorithms, and functional programming, especially in the Scala programming language.

In the past, I have done a lot of programming and engineering in C# on the .NET Framework, an environment where I feel very comfortable and knowledgeable. Past work includes payment processing systems, POS systems, and mobile systems. All of them have allowed me to grow professionally in different areas of expertise. 

Sometimes I write blog posts at medium.com/@anicolaspp that I share on twitter at @anicolaspp

Author's Posts

Posted on August 23, 2016 by Nicolas Perez

The Apache Spark community is thriving, and they have put a lot of effort into extending Spark. Recently, we have been interested in transforming an XML dataset into something that's easier to query. Our main interest is being able to do data exploration on top of billions of transactions that we get every day. In this blog post, I'll walk you through how to use an Apache Spark package from the community to read any XML file into a DataFrame.

Posted on July 28, 2016 by Nicolas Perez

Logging in Apache Spark is very easy to do, since Spark offers access to a logobject out of the box; only some configuration setups need to be done. In a previous post, we looked at how to do this while identifying some problems that may arise. However, the solution presented might cause some problems when you are ready to collect the logs, since they are distributed across the entire cluster.

Posted on May 10, 2016 by Nicolas Perez

Streaming data is a hot topic these days, and Apache Spark is an excellent framework for streaming. In this blog post, I'll show you how to integrate custom data sources into Spark.

Posted on April 19, 2016 by Nicolas Perez

In this blog post, you'll learn how to do some simple, yet very interesting analytics that will help you solve real problems by analyzing specific areas of a social network. Using a subset of a Twitter stream was the perfect choice to use in this demonstration...

Posted on March 17, 2016 by Nicolas Perez

SQL have been here for awhile and people like it. However, the engines that power SQL have changed with time in order to solve new problems and keep up with demands from consumers.

Posted on March 23, 2016 by Nicolas Perez

In my last post, we explained how we could use SQL to query our data stored within Hadoop. Our engine is capable of reading CSV files from a distributed file system, auto discovering the schema from the files and exposing them as tables through the Hive meta store. All this was done to be able to connect standard SQL clients to our engine and explore our dataset without manually define the schema of our files, avoiding ETL work.

Posted on March 1, 2016 by Nicolas Perez

An important part of any application is the underlying log system we incorporate into it. Logs are not only for debugging and traceability, but also for business intelligence. Building a robust logging system within our apps could be use as a great insights of the business problems we are solving.

Blog Sign Up

Sign up and get the top posts from each week delivered to your inbox every Friday!