DataKitchen DataOps Documention

DataKitchen Recipes, like their culinary counterparts, are a representation of stepwise processes, leveraging tools and inputs to generate delicious analytic insights. Recipes can be thought of as uni-directional graphs containing a number of discrete nodes, each corresponding to a specific step in a data analytic pipeline. Recipe graph nodes, or steps, are always processed sequentially, just like a culinary recipe.

Recipe graphs can understandably grow to be quite complex, containing steps for infrastructure configuration, the preparation of data asset inputs, data integrations, analyses, generation of data outputs, and visualization.

Below is an example of a recipe graph associated with a standard production-level data analytic pipeline. As the scope and frequency of analytic demand continues apace, DataKitchen has learned from firsthand experience the crucial need to implement DataOps best practices in order to maintain quality even improve quality while keeping pace with stakeholder expectations.

What Makes a Recipe Node?

Recipe nodes, or steps, are modular in that they are distinct mini-processes isolated in their processing from other nodes in their recipe graph. This modularity is achieved through containerization. Containers are lightweight, containing only the data, tools, code, libraries, and settings required to run the process for which they were designed.

Recipe nodes and their containers are delineated in their boundaries by the specific function they were created to complete. Multiple tools can be leveraged within a single recipe node container. For example, a recipe node could include all of the tools, settings, data, code, and variables required to complete the process of collecting a dataset, creating a database schema, creating a database table, and populating the new table with the collected data for downstream querying.

DataOps Best Practice: ReUse & Containerize

Containerization of recipe graph nodes, or steps, is powerful in that containers allow the user to leverage their preferred tool of choice whilst working with the DataKitchen platform. This is possible because each container is configured to run the user's tools of choice in a fashion isolated from the others nodes, or steps, within the recipe.

Because the entire processing of a recipe node occurs within its container, that node is guaranteed to run exactly the same regardless of the kitchen in which it is deployed. What is more, containerization allows data analytic teams the flexibility to grow their ranks with new team members who use tools distinct from their existing toolchain.

Recipe Structure

Every recipe has a standard structure composed of four components.

  • Description
  • Variations
  • Graphs and nodes
  • Variables

Recipe Description

Each Recipe has a description that appears on the Recipes page and in a recipe's description.json file. The file also includes some default email alert settings.

{
    "description": "",
    "recipe-emails": {
        "email-delivery-service": "aws-email"
    }
}

Recipe Variations

Another component that comprises every recipe is its set of Variations. Each variation of a recipe is a subset of nodes within the full recipe graph, which users can configure and run as a version of the full recipe. The variations are defined in a recipe's variations.json file.

Tabs on the Recipes page in the web app provide access to the selected recipe's structure.

Tabs on the Recipes page in the web app provide access to the selected recipe's structure.

Variation Use Case
Imagine a production-level, complex recipe graph much like the one described above. Assume that midway through the sequential processing of recipe nodes, a test implemented for a node that ingests and processes a data input has emailed a warning that test conditions have not been met. A user would want to investigate that node to resolve the error.

Rather than rerun the entire recipe, an inefficient process that consumes time and computing resources, the user can create a new recipe variation that includes instructions to only process those nodes downstream to the node that threw an error warning.

Recipe Graphs

Each recipe also has a component representing the structure of its graph of nodes. The Graphs tab on the Recipes page lists all of the graphs configured for the selected recipe.

The Graphs tab on the Recipes page lists the graphs in use for the selected recipe.

The Graphs tab on the Recipes page lists the graphs in use for the selected recipe.

Recipe graphs are easier to understand when visualized. The web app allows users to view and update graphs within each variation. See Graph for more information.

The visualization of a recipe variation graph.

The visualization of a recipe variation graph.

DataOps Best Practice: Add Data & Logic Tests

As is seen in lean manufacturing practices and in DataOps best practices, frequent testing is important in identifying and resolving aberrations as they occur. If each node in a recipe graph includes a test, a user can identify the node generating an error as opposed to investigating the entire data analytic pipeline to diagnose an issue. Tests are classified into three categories: logs, warnings, and failures.

Recipe Variables

A fourth component that comprises every recipe are its variables. Recipe variables are overridden by kitchen variables, and in turn override recipe variation variables.

Find recipe variable sets for a selected recipe listed in the Variable Sets tab on the Recipes page. Select Edit from an Actions menu to view all variables within the base variables set or within an override set.

DataOps Best Practice: Parameterize Your Processes

By leveraging variables, you can parameterize your processes and use your data in a more agile and productive way. Variables allow users to make slight adjustments to their recipe variations to run them differently without having to rebuild them.

For example, a user may wish to run a recipe variation with a data source generated on a specific date. Rather than build a new recipe variation pointing to a data source with a hard-coded date in its source name, the user can create a variable for the data source name and inject the specific date for that dataset version.

Recipe variables are far reaching and can refer to the specific filename of a datasource or even sensitive credentials that are important to keep securely hidden, perhaps even from some members of your team. In the example below at the bottom of the list, the AWS S3 access key has been created as the variable s3accesskey. Note that the value of this variable is stored in the vault.

Security of Sensitive Credentials

DataKitchen provides the flexibility such that all members of your data analytics team can easily ramp up and tear down development kitchens and their underlying infrastructure, without support from Data Engineers and other IT staff. Given this high degree of freedom and flexibility for all users, DataKitchen has built a Vault for all customer accounts where sensitive infrastructure credentials can be stored securely. Your team can use these credentials as variables to build their own kitchens without the need to view the actual values.

Updated 2 days ago

Recipes


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.