Guidelines for HBase Schema Design

In this blog post, I’ll discuss how HBase schema is different from traditional relational schema modeling, and I’ll also provide you with some guidelines for proper HBase schema design.

Relational vs. HBase Schemas

There is no one-to-one mapping from relational databases to HBase. In relational design, the focus and effort is around describing the entity and its interaction with other entities; the queries and indexes are designed later.

With HBase, you have a “query-first” schema design; all possible queries should be identified first, and the schema model designed accordingly. You should design your HBase schema to take advantage of the strengths of HBase. Think about your access patterns, and design your schema so that the data that is read together is stored together. Remember that HBase is designed for clustering.

Hbase vs Relational

  • Distributed data is stored and accessed together
  • It is query-centric, so focus on how the data is read
  • Design for the questions


In a relational database, you normalize the schema to eliminate redundancy by putting repeating information into a table of its own. This has the following benefits:

  • You don’t have to update multiple copies when an update happens, which makes writes faster.
  • You reduce the storage size by having a single copy instead of multiple copies.

However, this causes joins. Since data has to be retrieved from more tables, queries can take more time to complete.

In this example below, we have an order table which has one-to-many relationship with an order items table. The order items table has a foreign key with the id of the corresponding order.



In a de-normalized datastore, you store in one table what would be multiple indexes in a relational world. De-normalization can be thought of as a replacement for joins. Often with HBase, you de-normalize or duplicate data so that data is accessed and stored together.

Parent-Child Relationship–Nested Entity

Here is an example of denormalization in HBase, if your tables exist in a one-to-many relationship, it’s possible to model it in HBase as a single row.  In the example below, the order and related line items are stored together and can be read together with a get on the row key.  This makes the reads a lot faster than joining tables together.

Hbase nested entity

The rowkey corresponds to the parent entity id, the OrderId. There is one column family for the order data, and one column family for the order items. The Order Items are nested, the Order Item IDs are put into the column names and any non-identifying attributes are put into the value.

This kind of schema design is appropriate when the only way you get at the child entities is via the parent entity.

Many-to-Many Relationship in an RDBMS

Here is an example of a many-to-many relationship in a relational database. These are the query requirements:

  • Get name for user x
  • Get title for book x
  • Get books and  corresponding ratings for userID x
  • Get all userIDs and corresponding ratings for book y  

Hbase - book store example

Many-to-Many Relationship in HBase

The queries that we are interested in are:

  • Get books and corresponding ratings for userID x
  • Get all userIDs and corresponding ratings for book y

For an entity table, it is pretty common to have one column family storing all the entity attributes, and column families to store the links to other entities.

The entity tables are as shown below:

Hbase - user table

Generic Data, Event Data, and Entity-Attribute-Value    

Generic data that is schemaless is often expressed as name value or entity attribute value. In a relational database, this is complicated to represent. A conventional relational table consists of attribute columns that are relevant for every row in the table, because every row represents an instance of a similar object. A different set of attributes represents a different type of object, and thus belongs in a different table. The advantage of HBase is that you can define columns on the fly, put attribute names in column qualifiers, and group data by column families.

Here is an example of clinical patient event data. The Row Key is the patient ID plus a time stamp. The variable event type is put in the column qualifier, and the event measurement is put in the column value. OpenTSDB is an example of variable system monitoring data.

Hbase OpenTSDB

Self-Join Relationship – HBase

A self-join is a relationship in which both match fields are defined in the same table.

Consider a schema for twitter relationships, where the queries are: which users does userX follow, and which users follow userX? Here’s a possible solution: The userids are put in a composite row key with the relationship type as a separator. For example, Carol follows Steve Jobs and Carol is followed by BillyBob. This allows for row key scans for everyone carol:follows or carol:followedby

Below is the example Twitter table:

 Hbase Twitter example

Tree, Graph Data

Here is an example of an adjacency list or graph, using a separate column for each parent and child:

Hbase graph

Each row shows a node, and the row key is equal to the node id. There is a column family for parent p, and a column family children c. The column qualifiers are equal to the parent or child node ids, and the value is equal to the type to node. This allows to quickly find the parent or children nodes from the row key.

You can see there are multiple ways to represent trees, the best way depends on your queries.

Inheritance Mapping

In this online store example, the type of product is a prefix in the row key. Some of the columns are different, and may be empty depending on the type of product. This allows to model different product types in the same table and to scan easily by product type.

Hbase mapping

Data Access Patterns

Use Cases: Large-scale offline ETL analytics and generating derived data

In analytics, data is written multiple orders of magnitude more frequently than it is read. Offline analysis can also be used to provide a snapshot for online viewing. Offline systems don’t have a low-latency requirement; that is, a response isn’t expected immediately. Offline HBase ETL data access patterns, such as Map Reduce or Hive, are characterized by high latency reads and high throughput writes.

Hbase data access patterns

Data Access Patterns

Use Cases: Materialized view, pre-calculated summaries

To provide fast reads for online web sites, or an online view of data from data analysis, MapReduce jobs can reorganize the data into different groups for different readers or materialized views. Batch offline analysis could also be used to provide a snapshot for online views. This is going to be high throughput for batch offline writes and high latency for read (when online). 

Hbase use cases

Examples include:

• Generating derived data, duplicating data for reads in HBase schemas, and delayed secondary indexes

Schema Design Exploration:

  • Raw data from HDFS or HBase
  • MapReduce for data transformation and ETL from raw data.
  • Use bulk import from MapReduce to HBase
  • Serve data for online reads from HBase

Designing for reads means aggressively de-normalizing data so that the data that is read together is stored together.

Data Access Patterns

Lambda Architecture

The Lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.

Hbase - lamda architecture

MapReduce jobs are used to create artifacts useful to consumers at scale. Incremental updates are handled in real time by processing updates to HBase in a Storm cluster, and are applied to the artifacts produced by MapReduce jobs.

The batch layer precomputes the batch views. In the batch view, you read the results from a precomputed view. The precomputed view is indexed so that it can be accessed quickly with random reads. 

The serving layer indexes the batch view and loads it up so it can be efficiently queried to get particular values out of the view. A serving layer database only requires batch updates and random reads. The serving layer updates whenever the batch layer finishes precomputing a batch view.

You can do stream-based processing with Storm and batch processing with Hadoop. The speed layer only produces views on recent data, and is for functions computed on data in the few hours not covered by the batch. In order to achieve the fastest latencies possible, the speed layer doesn’t look at all the new data at once. Instead, it updates the real time view as it receives new data, instead of re-computing them like the batch layer does. In the speed layer, HBase provides the ability for Storm to continuously increment the real-time views.

How does Storm know to process new data in HBase? A needs work flag is set. Processing components scan for notifications and process them as they enter the system.

MapReduce Execution and Data Flow

Hbase mapreduce execution and data flow

The flow of data in a MapReduce execution is as follows:

  1. Data is being loaded from the Hadoop file system
  2. Next, the job defines the input format of the data
  3. Data is then split between different map() methods running on all the nodes
  4. Then record readers parse out the data into key-value pairs that serve as input into the map() methods
  5. The map() method produces key-value pairs that are sent to the partitioner
  6. When there are multiple reducers, the mapper creates one partition for each reduce task
  7. The key-value pairs are sorted by key in each partition
  8. The reduce() method takes the intermediate key-value pairs and reduces them to a final list of key-value pairs
  9. The job defines the output format of the data
  10. Data is written to the Hadoop file system

In this blog post, you learned how HBase schema is different from traditional relational schema modeling, and you also got some guidelines for proper HBase schema design. If you have any questions about this blog post, please ask them in the comments section below.

Want to learn more? Take a look at these resources that I used to prepare this blog post:

Here are some additional resources for helping you get started with HBase:



Driving The Next Generation Data Architecture
This paper examines the emergence of Hadoop as an operational data platform, and how complementary data strategies and year-over-year adoption can accelerate consolidation and realize business value.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free