DataKitchen DataOps Documention

Container Nodes

Nodes that run scripts and tools.

DataKitchen uses Docker containers to package code, runtime scripts, tools, and other settings so that a process executes efficiently and reliably without requiring custom infrastructure.

Container Nodes reference container images which become Docker containers at runtime, isolating a process from the environment to ensure that it runs the same way every time. DataKitchen provides a number of analytic container images that support any database, computing, programming, provisioning, data visualization, reporting, and business intelligence tools an organization requires.

Tool Support

Container nodes support the use of essentially any tool in the analytics toolchain, including scripts (Python, Java, Shell, etc.) and GUI tools like Jupyter and Tableau notebooks. Any DataKitchen container image available on Docker Hub may be leveraged in support of a tool in your toolchain.

DataKitchen Provides Many Standard I/O Connectors

For a list of DataKitchen supported standard I/O connectors that do not require the use of container nodes see Data Sources & Sinks documentation.

Container Images

DataKitchen provides a number of analytic container images that you can use directly or as bases on which you can build containers and add tools and proprietary libraries. The General Purpose Container (GPC) has a standard configuration and several pre-installed tools to get users started quickly with their recipe building.

See General Purpose Container for more information. And, find the GPC image on Docker Hub. Find other DataKitchen container images on Docker Hub as well.

DataKitchen Supports Alternative Container Image Hosts

Configure a connection to an alternative to DockerHub by editing the Docker Registry URL field of a Container Node's notebook.json file. This is the dockerhub-url field in the raw JSON.

Example value
http://containerhosting.website.com:5000/

Container Node Source Images

The container used for container nodes can be built directly from DataKitchen's analytic container, or a closely related derivative. Alternatively, users can leverage their own custom-built images for container nodes. In these cases, the node's notebook.json should include the following configuration.

"analytic-container": false,
"command-line": "[OPTIONAL COMMAND]"

Here, the "[OPTIONAL COMMAND]" may be something like:

"/bin/bash -c \"echo 'Hello world' > output.txt\""

Available Container Images

  • datakitchen/ac-base
    • Python 2.7 analytic container base image running Ubuntu 14.
    • Supports lists of parameters passed into container via config.json.
    • Supports running .py scripts located in a node's /docker-share directory.
    • The AC logger supports a subset of methods from the Python 2 logger.
      • Supports printing to logs via LOGGER.info().
      • Supports LOGGER.setLevel(); warning is the default level
        • Requires import logging in script to change default level
          • Change to DEBUG: LOGGER.setLevel(logging.DEBUG)
        • Alternate logging levels supported
          • LOGGER.debug()
          • LOGGER.warning()
          • LOGGER.error()
          • LOGGER.critical()
  • datakitchen/ac-base3
    • Python 3.4 analytic container base image running Ubuntu 14.
    • Supports lists of parameters passed into container via config.json.
    • Supports running .py scripts located in a node's /docker-share directory.
    • The AC logger supports a subset of methods from the Python 3 logger.
      • Supports printing to logs via LOGGER.info().
      • Supports LOGGER.setLevel(); warning is the default level
        • Requires import logging in script to change default level
          • Change to DEBUG: LOGGER.setLevel(logging.DEBUG)
        • Alternate logging levels supported
          • LOGGER.debug()
          • LOGGER.warning()
          • LOGGER.error()
          • LOGGER.critical()
  • datakitchen/ac_python3_public_container
    • Python 3.7 analytic container base image running Debian 9.6.
    • Supports the passing of parameters into containers via config.json via either a list or dictionary. with dictionary being the best practice.
    • Supports running .py scripts located in a node's /docker-share directory.
    • The AC logger supports a subset of methods from the Python 3 logger.
      • Supports printing to logs via LOGGER.info()
      • Supports LOGGER.setLevel(); warning is the default level
        • Requires import logging in script to change default level
          • Change to DEBUG: LOGGER.setLevel(logging.DEBUG)
        • Alternate logging levels supported
          • LOGGER.debug()
          • LOGGER.warning()
          • LOGGER.error()
          • LOGGER.critical()
  • datakitchen/jasper_container
    • Analytic container for Internet-of-Things (IOT) applications.
    • Generates a report based on provided data.
    • Report settings and input/output file settings are available.

Container base images are referenced via a Container Node's notebook.json file.

Container Files

Description.json

Like all nodes, container nodes require a node-level description.json file:

{
    "type": "DKNode_Container",
    "description": "[YOUR DESCRIPTION HERE]"
}

Notebook.json

Container nodes also require a node-level notebook.json file. This is where the configuration of the container itself is located:

{
    "image-repo"                 : "ac_process_info_container",
    "image-tag"                  : "latest",
    "dockerhub-namespace"        : "datakitchen",
    "dockerhub-username"         : "{{dockerhub.username}}",
    "dockerhub-password"         : "{{dockerhub.password}}",
    "analytic-container"         : true,
    "container-input-file-keys"  : [
      {
        "key"       : "inputfiles.some-input",
    	  "filename"  : "some_input_file.csv"
      }
    ],
    "container-output-file-keys" : [
      {
        "key"       : "outputfiles.some-output",
        "filename"  : "results.xml"
      }
    ],
    "assign-variables": [
      {
        "name"      : "rowcount",
        "file"      : "rowcount.txt"
      },
      {
        "name"      : "successcount",
        "file"      : "successcount.txt"
      },
      {
        "name"      : "failurecount",
        "file"      : "failurecount.txt"
      }
    ],
    "inside-container-file-mount"            : "/dk/ContainerWorkingDirectory",
    "inside-container-file-directory"        : "docker-share",
    "container-input-configuration-file-path": "the-node-name/docker-share",
    "container-input-configuration-file-name": "config.json",
    "container-output-log-file"              : "ac_logger.log",
    "container-output-progress-file"         : "progress.json",
    "delete-container-when-complete"	     : false
}

Docker-Share Directory

Every container node contains a /docker-share directly where the scripts the Node runs are stored.

Docker-Share/config.json

Every container node contains a config.json file within its /docker-share directory.

{
    "apt-dependencies": [ ],
    "dependencies": [ ],
    "keys": {
        "run_script": {
            "script": "",
            "environment": {},
            "parameters": {},
            "export": [ ]
        }
    }
}

Properties

Field
Description
required/optional

analytic-container

When true, it is expected to be a container, or based on one of the base containers provided by DataKitchen.

optional, default true

assign-variables

A list of associations between files inside the container and variables. These variables will be loaded with the contents of these files that should be inside the container once its execution finishes.
Variables can be later used in tests.

optional

command-line

The command to be executed by the container.

Only valid when analytic-container is false.

container-input-configuration-file-name

The name of the configuration file for the container.
Ignored when analytic-container is true.

optional, default config.json

container-input-configuration-file-path

The path to the folder of the files being exchanged with the container on the side of the recipe node, relative to the recipe root directory.
Ignored when analytic-container is true.

optional, default [[node-name]]/docker-share

container-input-file-keys

A list of mappings between data source keys and files to be placed inside the container once created.

optional

container-output-file-keys

A list of mappings between files inside the container and data sink keys, these files will be retrieved from the container once it finishes its execution, and sent to the data sinks.

optional

container-output-log-file

The name of the log file being generated by the container.
Ignored when analytic-container is true.

optional, default ac_logger.log

container-output-progress-file

The name of the progress json file being generated by the container.
Ignored when analytic-container is true.

optional, default progress.json

delete-container-when-complete

Determines if the runtime container is deleted immediately after its processing has completed and files have been extracted. With this function, runtime variables remain available for downstream processing. The default value is true, if a value is not specified.

Use this option in notebook.json to prevent resource-intensive containers from depleting disk space over time. Beware that deleting containers limits resources available for troubleshooting problems. See Container Resource Cleanup.

optional

dockerhub-namespace

Docker image namespace

required

dockerhub-password

Docker registry service password

required

dockerhub-url

The URL of the Docker registry from where images should be picked.

optional

dockerhub-username

Docker registry service user name

required

image-repo

The name of the docker image.

required

image-tag

The docker image tag denoting the image version to be pulled. The default value is "latest".

optional

inside-container-file-directory

The name of the folder that will be used to exchange information between the container and the node. It's relative to inside-container-file-mount.
When analytic-container is false, this folder must be placed in the working directory.

optional, default docker-share

inside-container-file-mount

The working directory inside the container.
Ignored when analytic-container is true.

optional, default given by the same container.

Container Node Input File Keys

In order to inject files from a data source inside the container we use this expression:

container-input-file-keys: [
  {
    "key"       : "inputfiles.some-input",
    "filename"  : "some_input_file.csv"
  }
]

The key field expresses a reference to a key in a data source, the expression is:

{
    "key": "[ data source name ].[ key name ]"
}

The file name is a name relative to the folder defined by the field inside-container-file-directory, with being docker-share the default folder name.

Configuring an Arbitrary List of Input Files Using Wildcards

container-input-file-keys: [
  {
    "key"       : "inputfiles.*",
    "filename"  : "*.csv"
  }
]

The * in key field will be replaced by each key in the datasource, and the one in filename by the key name. Check data sources section for more details about how to retrieve multiple files using wildcards in datasources.

Container Node Output File Keys

This is the way to export files from inside the container to a given data sink, is similar to input file keys

"container-output-file-keys" : [
  {
    "key"       : "outputfiles.some-output",
    "filename"  : "results.xml"
  }
]

Configuring an Arbitrary List of Output Files Using Wildcards

In order to export multiple files or files without a specific name, we use wildcards.

"container-output-file-keys" : [
  {
    "filename"  : "*.xml"
    "key"       : "store-results.*",
  }
]

Runtime Output Variables

It is possible to feed runtime variables with contents of files produced by the container.
These variables are available for tests and further usage in following nodes in the graph.
Optionally files can be decoded as json, by default is read as plain text.

"assign-variables" : {
    "name": "variablename",
    "file": "output.json",
    "decode-json": true
}

Examples

Example1

{
    "type": "DKNode_Container",
    "description": "Runs python3 with secrets from global and kitchen vault."
}
{
    "dockerhub-username": "#{vault://dockerhub/username}",
    "dockerhub-namespace": "datakitchen",
    "dockerhub-password": "#{vault://dockerhub/password}",
    "image-repo": "ac_python3_public_container",
    "metadata": {
        "name": "python3_container"
    },
    "analytic-container": true,
    "tests": {
        "test_global_key": {
            "test-logic": "result_global == 'global_val'",
            "action": "stop-on-error",
            "type": "test-contents-as-string",
            "test-variable": "result_global",
            "keep-history": false
        },
        "test_kitchen_key": {
            "test-logic": "result_kitchen == 'kitchen_value'",
            "action": "stop-on-error",
            "type": "test-contents-as-string",
            "test-variable": "result_kitchen",
            "keep-history": false
        }
    }
}
{
    "dependencies": [],
    "keys": {
        "run-script": {
            "script": "test.py",
            "parameters": {
                "global_secret": "#{vault://global/key}",
                "kitchen_secret": "#{vault://kitchen/key}"
            },
            "export": [
                "result_global",
                "result_kitchen"
            ]
        }
    }
}
import os

if __name__ == '__main__':
	global result_global, result_kitchen

	LOGGER.info("global secret: " + global_secret)
	LOGGER.info("kitchen secret: " + kitchen_secret)

	result_global = global_secret
	result_kitchen = kitchen_secret

(source)

Example2

{
    "type": "DKNode_Container",
    "description": "Runs the ac_python3_container2 image."
}
{
    "dockerhub-username": "#{vault://dockerhub/username}",
    "dockerhub-namespace": "datakitchen",
    "dockerhub-password": "#{vault://dockerhub/password}",
    "image-repo": "ac_python3_container2",
    "metadata": {
        "name": "python3_container"
    },
    "analytic-container": true,
    "tests": {
        "test-filecount": {
            "test-logic": {
                "test-compare": "equal-to",
                "test-metric": 10
            },
            "action": "stop-on-error",
            "type": "test-contents-as-integer",
            "test-variable": "result",
            "keep-history": false
        },
        "test-float": {
            "test-logic": {
                "test-compare": "equal-to",
                "test-metric": 1.234
            },
            "action": "stop-on-error",
            "type": "test-contents-as-float",
            "test-variable": "float_val",
            "keep-history": false
        }
    },
    "assign-variables": [
        {
            "name": "float_val",
            "file": "float.txt"
        }
    ]
}
{
    "dependencies": [],
    "keys": {
        "run-script": {
            "script": "test.py",
            "parameters": {
                "globalvar1": "value1",
                "dockerhub_username": "{{dockerhub_username}}"
            },
            "export": [
                "result"
            ]
        }
    }
}
1.234
row1
row2
row3
row4
row5
row6
row7
row8
row9
row10
import os

if __name__ == '__main__':
	global result

	LOGGER.info('Value of globalvar1:' + globalvar1)
	LOGGER.info("(should work) dockerhub username: " + dockerhub_username)
	LOGGER.info("(should not work) dockerhub password: " + "{{dockerhub_password}}")


	with open('docker-share/records.csv') as f:
		result = len(f.readlines())

	

Container Resource Cleanup

Over time, as the system executes order runs inside runtime containers and as container nodes spin up other runtime containers for processing, the cumulative overhead can deplete available disk space.

Customers managing their own resources should run clean-up tasks on a regular basis. Customers using DataKitchen's cloud-based or on-premise agent installations already have a cron script running the clean-up operation every two days. Customers can optimize the clean-up schedule if needed.

Delete Container Option

Alternatively, and for resource-intensive containers, users can consider adding the delete-container-when-complete option to their notebook.json parameters. When enabled, this setting instructs the system to remove the runtime container used for executing the container node immediately after the processing completes.

Considerations for deleting containers

  • The default value for delete-container-when-complete is "true" when the parameter is added with an empty value to notebook.json.
  • If the container node fails and the underlying runtime container has been deleted, it is more difficult to troubleshoot the problems.
  • You can set the value to "false" to maintain the containers for debugging if your recipe runs are problematic.
  • Deleting runtime containers does not affect the availability of runtime variables for any downstream processing that requires them.

Updated 2 months ago


Container Nodes


Nodes that run scripts and tools.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.