Get Real with Hadoop: Ease of Data Integration

In this blog series, we’re showcasing the top 10 reasons customers are turning to MapR in order to create new insights and optimize their data-driven strategies. Here’s reason #3: MapR provides ease of data integration through industry standard interfaces for data access and movement such as read-write NFS, ODBC, REST APIs and LDAP. Read-write NFS access is one of the key reasons customers choose MapR.

The MapR Distribution including Apache Hadoop fully supports many open APIs, and one of those is the Network File System (NFS). As many of you already know, NFS provides very easy network access to data on disk, and in the Hadoop world this means data residing in the Hadoop cluster. Data stored in HDFS can be accessed as if it were stored on a standard POSIX-compatible file system.

Hadoop users can therefore mount clusters using NFS to ingest or extract data directly with standard tools, applications, and scripts, and even run useful tools like grep, awk, sed, etc. You could, of course, also immediately run analysis on top of ingested data (if data needs to be static for the analysis, you can use MapR consistent snapshots). For example, you could run a MapReduce job on log files just ingested from an app or web-server and write the result as a CSV file, which then is immediately available to be inspected by the analyst.

Compared to other Hadoop distributions that only let you import data via the HDFS API, MapR lets you mount the cluster so that legacy and bespoke applications can read, write and modify data in the Hadoop cluster without alterations to the applications. Likewise, very importantly, it means that MapR doesn't need as many connectors or bespoke integrations to these applications, and, as you all know, there are many applications that may want to interface with Hadoop.

Now, we at MapR often hear that other distros can use NFS too, or that there is work in progress related to NFS [1]. This is indeed correct, but the devil is in the details. Other distributions may provide some level of read access and limited ability to write once, but it is harder to support traditional file systems usage or run real legacy applications on top of them. MapR has a fundamentally different implementation and architecture in MapR-FS–while still 100% compatible to the HDFS API. This is something that we have worked hard at and are very proud of, and the point is that this architecture enables allocation of small block sizes (8kb) as well as random reads and writes.

It is important to note that HDFS and NFS are fundamentally different semantically, which makes it very hard to get high performance out of NFS on top of HDFS. In contrast, using NFS on MapR-FS is a completely different story because MapR-FS very well supports the underlying requirements (random reads and writes, small block sizes, concurrent threads and POSIX semantics) to provide not only NFS access, but also optimal performance.

This means that you can use NFS to read and write data, including any kind of file sizes with optimal performance. I have worked with a number of customers utilizing NFS where traditional NFS implementations has been the norm. Customers who have thousands of small files (0-120kb) or applications are able to run them unaltered on top of MapR with good performance (e.g. HP/Vertica on MapR).

If you want to see how easy this is, just take a look at the demo, and see something almost every MapR customer uses on a daily basis.

Now this really expands simplicity in getting started using Hadoop for analyzing your data.

[1] Cloudera has worked on a NFS v4 proxy (Cloudera github, courtesy of Brock Noland) as well as the HDFS-4750 which Brandon Li has worked on a NFS v3 proxy with others.

Be sure to check out the complete top 10 list here.

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free