Polyglot Data Management
At the Big Data Everywhere conference held in Atlanta, Senior Software Engineer Mike Davis and Senior Solution Architect Matt Anderson from Liaison Technologies gave an in-depth talk titled “Polyglot Data Management,” where they discussed how to build a polyglot data management platform that gives users the flexibility to choose the right tool for the job, instead of being forced into a solution that might not be optimal. They discussed the makeup of an enterprise data management platform and how it can be leveraged to meet a wide variety of business use cases in a scalable, supportable, and configurable way.
Matt began the talk by describing the three components that make up a data management system: structure, governance and performance. “Person data” was presented as a good example when thinking about these different components, as it includes demographic information, sensitive information such as social security numbers and credit card information, as well as public information such as Facebook posts, tweets, and YouTube videos. The data management system components include:
1. Structure: Data is not schema, but it can have a variety of shapes. String cubes, graphs, relational tables, and trees are all examples of different data shapes. However, it’s important to think of shape and structure as separate from the data itself, because a single bit of data can have multiple different shapes. In the case of person data, for example, your social data may best be represented by a stream, a graph could be used for looking at links between friends, relatives and co-workers, and a relational model would work best for demographic information.
Different types of data shapes
2. Governance: In addition to the data itself, you have metadata about the data. Data management requires governance, so you have to think about security/compliance issues as well as quality issues. Security/compliance areas include encryption, access controls, lineage, and auditing (who saw what data, and when). Quality issues include validation, business rules, cleansing and access. If you are thinking about creating a data management platform, you need to ensure that your data is clean and valid, and that you have the right data at the right time. Matt mentioned that MapR security features have made a lot of headway in this area.
Validation, cleansing, and identity resolution can be applied to demographic data; how do you know if Joe Smith is the same person as Joseph Smith? Being able to run those types of rules and have a system in place that can take that information, cleanse it, and put it into a clean record is vitally important.
3. Performance: Your data management solution needs to be scalable, fast, fault tolerant and robust.
When looking at the entire data management spectrum, there really is no “silver bullet.” Not every type of data needs all the different properties. There is a wide range of data management solutions, ranging from the “safe” traditional approach of an RDMBS to a more flexible approach found in some of the big data technologies. In terms of polyglot data management, it’s important to pick the right tool that you need for your use case.
Mike Davis then spoke about polyglot data management, which is essentially the ability to choose the correct type of data management solution for the job instead of using the same, possibly ill-fitting solution for everything. Ideally, you want a data management system where you can view your data in any way that’s necessary.
He then discussed the three basic tenants of what you should look for in a polyglot data management platform: Primitives, Specification, and Orchestration.
Primitives include Persist, Define, Query, Event, Flow, Explore, and Secure.
- The Persist component consists of the underlying storage technology and a data access layer for interacting with the lower level APIs. It’s responsible for supporting (at a low level) CRUD operations, indexes, partitioning, cache strategies, lineage, etc.
- Define gives information about the data and is related to persist. Some aspects of define include shape, data types, validation rules and constraints, relationships/interactions/dependencies, and lifecycle.
- For querying, you need to figure out how to get to the data once you’ve persisted it. The traditional SQL approach for RDBMS in Hive allows you to do that in big data settings, but it’s not appropriate for accessing data when it’s in a graph. In that case, you have specialized languages like Gremlin for graph traversal. If you’re storing your data in something like Lucene, SQL is not really appropriate – you need some other type of querying mechanism for doing things like match scoring.
- Data management is not static; there are events (this is called a reactive programming model). An event can be considered a catalyst, such as ingest or schedule, and it also can be an effect, such as auditing/logging or error handling.
- The graph below shows a typical example of data flow. Data has motion, which is represented in the graph below as a configurable sequence of activities triggered by events.
- Processing can be stream or batch
- Processing should support concurrency/parallelization where possible
- Flow activities should be reusable and configurable
- Explore is the function of Define, Query and Flow used for discovery that should driver refactoring of the solution. Explore is an iterative process to refine a solution, and can be manual (through UI/visualization tools) or automated (machine learning).
- Secure underlies all components of the system, and provides the framework for security and compliance operations.
The solution specification ties it all together. The specification:
- Includes a declarative description of the solution, and documents solution requirements.
- Includes domain-specific language that is suitable for a domain expert with no programming experience, is used for designer UI, and supports reuse.
- Is used to create and deploy the application, manage changes, and aid in end-to-end testing.
The orchestration component processes a solution specification and coordinates all of the other components to execute the specification.
- Orchestration can be performed at design time (deploying applications in Puppet or Ansible or running DDL to create a design time definition).
- Orchestration can also occur at run time, when executing a data flow or allocating dynamic resources (YARN).
- Orchestration is also part of resource management, and should provide reactive strategies such as self-healing and re-allocation of resources.
Want to learn more?