High Performance C APIs on MapR-DB

C Vs. Java APIs

Native languages like C/C++ provide a tighter control on memory and performance characteristics of the application than languages with automatic memory management. A well written C++ program that has intimate knowledge of the memory access patterns and the architecture of the machine can run several times faster than a Java program that depends on garbage collection. For these reasons, many enterprise developers with massive scalability and performance requirements tend to use C/C++ in their server applications in comparison to Java. Thus, the need to provide C APIs for MapR-DB.

With the 4.1 release of the MapR Distribution, we have extended the libMapRClient library to allow allow users to write applications in C or C++ that can efficiently interact with MapR-DB. A paradoxical side-effect is that widely used dynamic languages such as Javascript and Python also benefit from efficient C access as well even though they are not normally viewed as high performance languages.

Background

A C language API for HBase known as libHBase was released in March last year (https://github.com/mapr/libhbase). This implementation leveraged the AsyncHBase Java library to interact with the HBase cluster. Since MapR-DB supports AsyncHBase as well as the synchronous HBase API’s, anyone can use libHBase to talk to MapR-DB as well as to HBase. The libHBase APIs are much faster than the HBase Thrift APIs, but they still incur a serious penalty due to embedding Java code in a C program because this embedding forces data to be copied from C data structures to Java. Even worse, since the MapR database client performs RPCs in native code, applications that use MapR DB incur this penalty twice, because  data must be copied multiple times. The figure below shows how this happens.

 

C API - MapR-DB

The motivation for this project was to bypass the Java layer completely and directly encode the user application data into RPC buffers by calling into the MapR native database client from C directly. The following figure shows how this eliminates the need to cross the JNI barrier twice.

 

C API libMapRClient

Advantages

  • No JVM is spawned
  • No JNI (Java Native Interface) overhead imposed on the application
  • No duplication of data buffers needed to transition between Java and C land
  • No garbage collection uncertainty
  • Tighter control on memory and CPU usage

Asynchronous Architecture

The MapR-DB C APIs are asynchronous in nature which means that calls return instantly, even before any results are received. The alternative is to make all calls wait for completion. Our experience, and that of many others, is that the use of RPC calls that block until completion are a serious impediment to high performance at scale. This was the original reason for the introduction of the AsyncHBase API library. If an application requires a synchronous API, it is very easy to write synchronous wrappers on the asynchronous methods (just invoke the method and wait for the callback). It is much more difficult to convert a synchronous API into a performant asynchronous API.

A practical impact of this is that all methods that can result in an RPC must accept a callback parameter as an argument.

As an example, here is the core API point for any operation that mutates data.  The cb argument is the callback and the mutation argument is where the actual operation is specified.

HBASE_API int32_t

hb_mutation_send(

   hb_client_t client,

   hb_mutation_t mutation,

   hb_mutation_cb cb,

   void *extra);

The following figure shows how the API’s work internally.

C APIs - Hbase and MapR-DB

When these asynchronous methods are invoked, a work item is created and queued for processing on the client side. This work item will be picked up as soon as possible by one of the threads in a thread pool. When responses to RPC calls are received the callback will be invoked by the thread pool.

Client applications are often faster than RPC calls, so we need to make sure that the queue of work items does not grow without bound. For this reason, we have a config parameter fs.mapr.pool.queue.max_size (default 10000) which controls the maximum size of the work item queue. This parameter can be modified by updating the /opt/mapr/conf/dbclient.conf file.

 

C APIs - Hbase and MapR-DB

Whenever the work item queue size reaches this limit, the library return ENOBUFS errors for the asynchronous calls. The client application is expected to handle this error, and can decide retry invoking the asynchronous call after some time. Another option is to pass a shared global condition variable to all callbacks via the extra argument so that the callbacks can signal the condition variable as they complete.  The completion of any pending callback is a likely indication that the ENOBUFS condition has been cured and an operation should be retried.

Performance

Performance was our foremost goal when we started working on this project. On that front, our implementation has the following characteristics:

  • The library does not copy any of the user application allocated buffers. It rather just maintains references to it. These buffers are then directly encoded into RPC buffers. Thus, the library expects that the user application gives up the ownership of these buffers till the time the callback is invoked. Once the callback is invoked, ownership is returned to the application so that these buffers can be destroyed or re-used as appropriate.
  • These config parameters that can be tuned by the client application to trade throughput versus resource usage.

Our new API has a number of important features that are not available in libHBase:

  • Secure user impersonation while creating connection

  • Adding or modifying column families of a table

  • Setting timestamps or time range for get and scan operations

  • Increment and append mutations

  • Filtering on column family and column name in scan operations

  • For MapR versions >= v5.0: HBase thrift language compliant filter support for get/scan operations (http://hbase.apache.org/0.94/book/thrift.html)

Learn more about creating native applications for MapR-DB here: http://doc.mapr.com/display/MapR/Creating+MapR-DB+Applications+with+C

Getting Started

To help you get started quickly, we have added two sample applications as part of the installation package. You can find them under /opt/mapr/examples directory when you install mapr-client. These applications are also located in a github repository here: https://github.com/mapr-demos/c-api-sample-applications

Learn more about the sample applications here:

In this blog post, you’ve learned about high performance C APIs on MapR-DB. If you have any questions, please add your comments in the section below..

Want to learn more?

no

CTA_Inside

The Forrester Wave™: NoSQL Key-Value Databases, Q3 2014
Read the report to learn about: Current NoSQL market trends, Key criteria for selecting a NoSQL solution, Why MapR scored highest for its current offering of all the comparable vendors.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free