GraphX is a graph library that runs on top of Apache Spark. Developers can use the languages and tools they are familiar with using for Spark to implement new types of algorithms that require the modeling of relationships between objects.

Graph processing is the backbone for many real-world applications, such as:

  • Social Networks: Users of social networks like Facebook and LinkedIn have one or (hopefully) more "friends" or "connections". These sites use graphs to model relationships between users, and run algorithms on these graphs to suggest new connections to users.
  • Networking: Data networking, that is. The internet is built from hundreds of thousands of routers, connected together will millions of links. Graph algorithms help determine the best path to use to send data from one user to another.
  • Astrophysics: Researchers use graphs to model the relationships between planets and galaxies to assist with discovery and classification.

Until recently, developers had to choose a language and library that was either optimized for graphs or for traditional table data. However, many use cases require that developers have access to both simultaneously. For instance, a recommendation algorithm may take a social graph as one input and a table of product ratings as another. Furthermore, the developer writing that recommendation algorithm may want to take care of standard machine learning clustering algorithms like k-Means.

GraphX and Spark provide a comprehensive platform to solve these kinds of problems. By adding a library of graph functions (GraphX) and a library for machine learning (MLlib) to a platform that understands table data (Spark, Shark), developers can seamlessly develop algorithms that take advantage of all functionality at once.