DataKitchen DataOps Documention

Overview

DAGs

DataKitchen Recipes are directed acyclic graphs (DAGs) with steps that can either do data work or provision infrastructure.

Nodes & Edges

  • Nodes: The components of a graph associated with the steps that do work in a recipe are called Nodes. A graph requires at least a single node to be valid. Nodes come in many types and configurations.
  • Edges: Nodes in a graph are connected by Edges, though these are not required. Edges are always connected to two nodes. Edges are also directional in that they are outbound from one Node and inbound to another node. Nodes may be connected to multiple inbound and outbound nodes or none at all.

Node Position

Nodes can also be described based on their position in a graph, which is determined by their Edge connections.

  • Origin Nodes: Origin nodes have no inbound edges. Graphs contain one or more origin nodes. Because edges are directional, Graphs commence processing at origin nodes.
  • Intermediate Nodes: Intermediate nodes*possess at least one inbound edge and one outbound edge.
  • Terminal Nodes: Terminal nodes are graph nodes with no outbound edge. Graphs contain one or more terminal nodes.

Graph Size

  • Single-Node Graphs: Graphs may be as small as a single node with no edges. In this example, node1 is both an origin node and a terminal node.
A recipe variation graph consisting of a single node with no edges.

A recipe variation graph consisting of a single node with no edges.

  • Multi-Node Graphs: Graphs can also be very complex, consisting of dozens or even hundreds of nodes and edges.
An example of a complex graph.

An example of a complex graph.

Floating Nodes

Multi-Node graphs may also contain floating nodes, which share no edges with any other nodes in the graph.

Graph Structure

Graph nodes can be run in series or in parallel.

  • Series Graph
A simple graph where all nodes run in series.

A simple graph where all nodes run in series.

  • Parallel Graph
A graph with nodes in parallel and in series.

A graph with nodes in parallel and in series.

Multiple Recipe Graphs

Recipe graphs are built from the available nodes in a recipe, though not all nodes need to be used in a graph. Each recipe must have one graph but may have many graphs, each one included in a defined variation. Variations may share graphs.

Recipes may contain many graphs

Recipes contain one or more saved parameter configurations called variations. Configurations in Recipe Variations override default recipe values for any parameter. One example of a variation override is a graph.

Configuration

Graphs are defined in each recipe's variations.json file under the graph-setting-list.

{
    "variation-list": {
        "single_node_variation": {
            "description": "A Variation consisting of a single node.",
            "graph-setting": "graph_single_node",
            "override-setting": [
                "example"
            ]
        },
        "multi_node_variation": {
            "description": "A Variation consisting of a graph with multiple nodes.",
            "graph-setting": "graph_multiple_nodes",
            "override-setting": []
        }
    }
,
    "graph-setting-list": {
        "graph_single_node": [
            [
                "node1"            ]        ],
        "graph_multiple_nodes": [
            [
                "node1"            ],
            [
                "node2"            ]				]

    }
,
    "override-setting-list": {
        "example": {
            "key": "value"
        }

    }
,
    "mesos-setting-list": {
        "daily_5am_edt": {
            "schedule": "0 5 * * *",
            "scheduleTimeZone": "US/Eastern",
            "epsilon": 1800
        }
    }
}

Building Graphs

Graphs are most easily created and edited via DataKitchen's web app.

Adding Components

Nodes of several types can be added and existing nodes not presently included in a recipe variation's graph may also be added. See Nodes for more information.

Build a graph using several node types.

Build a graph using several node types.

History of Graph Changes

Graph edits are edits to underlying recipe files contained in a kitchen, which are subject to version control. Each edit to a graph's structure is thus recorded in an update record that includes a detailed file-diff highlighting changes among other metadata.

Deleting Components

Nodes and edges may also be removed from recipe-variation graphs.

Removing Nodes & Edges

Note that removing nodes and edges from a graph only removes these components from the variations associated with that specific graph. These nodes will continue to exist as part of the recipe and will be retained by other graphs that include them.

Change Graph Orientation

The graph for a recipe variation can be displayed either with a top --> down or left --> right orientation from the build or run review screens.

Processing Graphs

Graphs process when orders are run for a recipe variation.

Node Status Icons

The way nodes are displayed on the Order Run Details page when a graph is processing is dependent on their processing status at any given time. Processing status dictates the node color and icon.

Process Starting Points

Graphs begin processing all origin nodes. The processing begins in parallel if more than one origin node exists.

Graph processing commences with the origin nodes, node1 and node2.

Graph processing commences with the origin nodes, node1 and node2.

Order of Execution

Graphs process in bands or levels, where the nodes in each level must complete before any nodes in the next level can be processed. If any node fails, the execution of the entire graph is halted.

The graph processes all nodes in a level before moving on to process next-level nodes.

The graph processes all nodes in a level before moving on to process next-level nodes.

New Graph Execution Process!

A new order of execution is in the 26 May 2020 release! This fast-throughput feature is turned off by default but can be turned on by request.

The following section details the new functionality.

New Order of Execution
Graphs process nodes in serial order in the new fast-throughput processing method. Parallel graphs process nodes sequentially within each path of the graph.

Every node begins processing as soon as its predecessor completes its execution. In this way, the platform does not have to suspend processing in one path of a parallel graph while nodes in another path take significantly longer to complete.

Sequential processing allows node_e and node_g to process while node_d is still in progress.

Sequential processing allows node_e and node_g to process while node_d is still in progress.

Failures

  • Series graph: If a node fails, the processing of a series graph will stop. Any data sink configured for the failed node will not run, and no downstream nodes will process.
  • Parallel graph: If multiple nodes exist in parallel, any path that is parallel to the failed node will continue processing until the path terminates or rejoins the failed path.
A failure in one path of a parallel graph does not prevent other paths from processing.

A failure in one path of a parallel graph does not prevent other paths from processing.

Resuming Post-Failure

If a graph's processing fails or is manually stopped, it can be resumed from the point where an error was previously encountered. Specifically, resuming graph processing will only reprocess failed and pending nodes.

In this way, the system can skip nodes that already processed successfully, saving time and compute resources.

Resuming Graph Processing Only Reprocesses Failed Nodes

Nodes that processed successfully as part of the initial failed run will be skipped when resuming.

Best Practices

Use a Node Once per Variation

Recipe nodes can only be referenced once in a variation graph, otherwise, a circular processing cycle is created. Graphs with circular paths will not run.

Cyclic Paths are Not Supported

Graphs with cyclic paths will not run as they will fail validation errors when processing.

The platform does NOT support a graph that references the same node twice, creating a cyclic processing path.

The platform does NOT support a graph that references the same node twice, creating a cyclic processing path.

Timing Efficiency

If your graph contains nodes that take longer to process or are less robust than other nodes, you may want to push the processing of these nodes to downstream levels. This can be accomplished by inserting intermediate placeholder nodes that process quickly.

Timings Metrics

A history of timing metrics for full graphs and node-by-node is provided longitudinally across runs.

Export Graph Image

Graph images may be exported to PNG files from either the build or run views. The exported image will match the graph display at the time of export, including the display of node processing status.

Use the Export Graph function to download a PNG file of the graph.

Use the Export Graph function to download a PNG file of the graph.

Updated 27 days ago


Next Up:

Nodes

Graph


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.