Cask Data Application Platform (CDAP) is an open source framework to build and deploy data applications on Apache™ Hadoop®.
CDAP is an abstraction layer on top of Hadoop and other open source infrastructure such as HBase, Hive, Tephra, and Tigon that enables developers to rapidly build, and operations to easily manage, real-time and batch data applications.
CDAP is oriented around the concepts of Datasets, Applications, and Services, and is supported by Tools, Packs, and Sample Apps.
CDAP Datasets are logical representations of data stored in HDFS and HBase. Datasets provide the layer for writing data from applications, agnostic to the underlying storage engine. They allow you to encapsulate your applications data access patterns in reusable libraries.
CDAP Applications consist of programs that use different open-source processing frameworks such as MapReduce, Spark and realtime Flow. CDAP comes with program containers to integrate each processing framework and provide a standardized way to develop, deploy, and manage programs.
CDAP Services are system-level services that are commonly required to support data and applications in development and production environments. This includes application management, metadata management, streams, and security.
Areas of focus:
- Data Collection: A method of getting data into the system, so that it can be processed.
- Data Exploration: One of the most powerful paradigms of Big Data is the ability to collect and store data without knowing details about its structure. These details are only needed at processing time. An important step—between collecting the data and processing it—is exploration; that is, examining data with ad-hoc queries to learn about its structure and nature.
- Data Processing: After data is collected, we need to process it in various ways.
- Data Storage: The results of processing data must be stored in a persistent and durable way that allows other programs or applications to further process or analyze the data. In CDAP, data is stored in datasets using the abstraction layer provided by CDAP, and domain APIs provided by datasets. This allows different data processing paradigms to interact with the dataset in their own way; in turn, this provides the flexibility in processing that a developer is looking for.
- Data Serving: The ultimate purpose of processing data is not to store the results, but to make these results available to people and other applications. For example, a web analytics application may find ways to optimize the traffic on a website. However, these insights are worthless without a way to feed them back to the actual web application. CDAP allows serving datasets to external clients through procedures and services.