DataKitchen DataOps Documention

General Purpose Container

DataKitchen configured a General Purpose Container (GPC) as a base container or reference design intended to simplify container builds.

  • The GPC allows users to start fast in DataOps by sourcing existing assets and performing quick data transformations within a pre-configured container. Users can leverage Docker containers without having any prior expertise.
  • The GPC is configured to handle many basic use cases, such as running Python or shell scripts, provisioning with Ansible, and generating data visualizations in Tableau.
  • The GPC has several pre-installed tools, such as pandas and numpy, that perform common data analysis actions or work with cloud computing services.
  • The GPC sets a standard structure for passing parameters and variables, installing dependencies, retrieving files, and logging without requiring users to write custom code.

GPC Availability

The General Purpose Container is publicly available as a container image on Docker Hub.

GPC Overview

Default Container

When you add a container node to a recipe graph either in the user interface or by the command line interface, the new node uses the GPC configuration. By default, your container builds from a container image with the following characteristics.

  • Uses Python3 exclusively.
  • Supports lists of parameters passed into container via config.json.
  • Supports running scripts (.sh, .py, .ipynb) located in a node's docker-share directory.
  • Supports a subset of methods from the Python3 logger. See Logging.

GPC Container Image

The General Purpose Container is also publicly available as a container image on Docker Hub.

Pre-Installed Packages

You can avoid having to install many apt-get and Python packages, as the GPC is built with several tools.

See the full list of pre-installed packages. You can reference any of these apps in your scripts.

Additionally, the GPC configuration allows you to specify other apt-get and Python packages to be installed at order runtime, so you can iterate fast without having to rebuild new containers.

GPC File Structure & Configuration

This basic tree comprises the GPC node structure. This section describes each component in detail.

node_name
    description.json
    docker-share
        config.json
    notebook.json

description.json

The required description.json file defines the node type and node icon (if not the default Docker icon). It also contains an optional node description.

{
    "type": "DKNode_Container",
    "icon": ""
}
{
    "type": "DKNode_Container",
    "icon": "tableau",
    "description": "Installs a Python package to update and deploy a Tableau Notebook from dynamic data sources generated upstream in a graph."
}

docker-share directory

Store your scripts that will be executed in the container in this required directory. Scripts must also be referenced in docker-share/config.json.
Valid Scripts
*.sh (bash/shell scripts)
*.py (Python3 scripts)
*.jpynb (Jupyter Notebook scripts)

For other languages and tools, such as Ansible, Jasper, and Tableau, DataKitchen offers additional container images and will be extending the GPC to support more script types in the future.

docker-share/config.json

The required config.json file defines the container runtime execution. It identifies apps to be installed in the container, the scripts to be executed in the container, the variables to use in the scripts, and the values to export for testing.

  • apt-dependencies: The list of software packages to be installed in the container at runtime via apt-get (Ubuntu). This list can be empty.
  • dependencies: The list of software packages to be installed in the container via pip (Python). This list can be empty.
    • You can install newer versions to overwrite the default Python packages here if you pin the versions properly, as in openpyxl==3.0.3.
    • Be careful that you do not install a dependency that conflicts with any of the pre-installed tools. If that happens, an order run log will record entries similar to the following examples.
       [NodeThread:gpc] AC: 2020-05-08 18:03:04,563 INFO : Installing runtime pip dependencies ...
        [NodeThread:gpc] AC: 2020-05-08 18:03:12,068 ERROR : Unable to install dependencies, error:
        [NodeThread:gpc] Collecting cryptography==2.4 (from -r runtime-requirements.txt (line 1))
        [NodeThread:gpc] subprocess.CalledProcessError: Command '['pip3', 'install', '-r', 'runtime-requirements.txt']' returned non-zero exit status 1.
        [NodeThread:gpc] NodeConsumer:Run: run notebook fail: gpc
        [VariationThread] Got an error in node gpc, stopping
      
  • keys: The dictionary of scripts to be executed in the container.
    • If there are multiple scripts, they will be run in the order listed in this config file. For example, an initial script could fix any dependency issues, then a subsequent script could run for a data transformation.
  • run_script: The field containing the script to be executed in the container node along with its associated parameters and variables. DataKitchen's standard recommends naming this field for the script it is running.
    • You can include several of these fields in config.json, one for each script you want to run in the container node.
  • script: The name of the script (.sh, .py, .ipynb) to be executed against existing assets. This field is required and cannot be empty.
  • environment: The dictionary of environment variables to be set for shell scripts, Python scripts, and Jupyter Notebooks. This field is optional.
    • This environment field is used to assign variables and vault secrets to inject into shell scripts, since the parameters field is not supported for shell scripts. See the parameters field description below for sample field entries.
    • This field is not as important for Python scripts, as all variables can be assigned using parameters.

Scripts Fail on Secrets

Warning: Your script will fail if it includes vault expressions or Jinja variable expressions that resolve to vault secrets. Instead, add the secret to a parameter or environment variable. This validation is a security measure to protect your toolchain credentials.

  • parameters: The dictionary of parameters to be set in the container. This field is where you can define parameters and the vault paths for toolchain credentials or secrets that need to be injected in a Python script or Jupyter Notebook. Another common use of parameters is to reference runtime variables generated by upstream nodes in the recipe graph for use in executing the container node. This field is optional.
    • This field accepts three types of entries: a parameter name/value string, a vault secret, and a defined variable.
      "simple_param": "param_value",
       "vault_param": "#{vault://dockerhub/secret_name}",
       "jinja_param": "{{variable_defined_inrecipe}}",
      
    • Note that parameters do not apply to shell scripts. Use environment variables instead.

Scripts Fail on Secrets

Warning: Your script will fail if it includes vault expressions or Jinja variable expressions that resolve to vault secrets. Instead, add the secret to a parameter or environment variable. This validation is a security measure to protect your toolchain credentials.

  • export: The list of variable names to export from the container for use in tests. When exported, these variables will be available for other nodes. This field is optional.

    • Note that the export field is not used in shell scripts.
    • The workaround for exporting values from shell script execution is to use echo commands in the script to write value outputs to a file in the docker-share directory, then assign a variable in notebook.json to the contents of that file. Finally, configure tests using that assigned variable. See Shell Script Exports for more information.
    • Any variable listed here should be declared as a global variable with a defined value within the Python script.

      Example: export variable in config
            "keys": {
                "python_script": {
                        "script": "basic.py",
                        "export": [
                                "success"
                        ]
                }
      
      Example: global variable in python script
        import os
        import sys
        import traceback
      
        global success
      
        ...etc.
      
{
    "apt-dependencies": [ ],
    "dependencies": [ ],
    "keys": {
        "run_script": {
            "script": "",
            "environment": {},
            "parameters": {},
            "export": [ ]
        }
    }
}
{
    "apt-dependencies": [ ],
    "dependencies": [
        "tableauserverclient==0.11"
    ],
    "keys": {
        "tableau_script": {
            "script": "tableau_test.py",
            "environment": {},
            "parameters": {
                "PASSWORD": "#{vault://tableau/password}"
            },
            "export": [
                "success"
            ]
        }
    }
}

notebook.json

The required notebook.json file defines the container configuration.

  • image-repo: A string defining the Docker Hub image repository to use. The GPC default value is the GPC image repo.
  • dockerhub-namespace: A string defining the Docker Hub namespace to use. The GPC default value is the GPC namespace.
  • image-tag: A string defining the Docker Hub image tag to use. The GPC default value is the GPC tag.
  • dockerhub-username: A string defining the Docker Hub username, generally set in kitchen vault secrets.
  • dockerhub-password: A string defining the Docker Hub password, generally set in kitchen vault secrets.
  • analytic-container: A boolean which defaults to true for containers that incorporate the DataKitchen Interface Layer.
  • tests: A dictionary of tests to apply to container variables. This parameter is optional, but strongly recommended. DataKitchen recommends building tests on all graph nodes to ensure data integrity.
  • Example notebook.json options:
    • container-input-file-keys and container-output-file-keys provide a method of exporting a file containing transformed data to a data sink, rather than just exporting parameter values as defined in export fields of config.json.
    • delete-container-when-complete, when set to true, instructs the process to delete the runtime container post-execution for disk space cleanup. See Container Node Properties for more information.
{
    "image-repo": "{{dockerhubConfig.image_repo.general_purpose}}"
    "dockerhub-username": "{{dockerhubConfig.username}}",
    "dockerhub-namespace": "{{dockerhubConfig.namespace.general_purpose}}",
    "dockerhub-password": "{{dockerhubConfig.password}}",
    "image-tag": "{{dockerhubConfig.image_tag.general_purpose}}",
    "analytic-container": "{{dockerhubConfig.ac.general_purpose}}",
    "tests": {},
}
{
    "image-repo": "{{dockerhubConfig.image_repo.general_purpose}}",
    "image-tag": "{{dockerhubConfig.image_tag.general_purpose}}",
    "dockerhub-namespace": "{{dockerhubConfig.namespace.general_purpose}}",
    "dockerhub-username": "{{dockerhubConfig.username}}",
    "dockerhub-password": "{{dockerhubConfig.password}}",
    "container-input-file-keys": [
        {
            "filename": "{{global_superstore_orders_filename}}",
            "key": "s3_datasource.mapping1"
        }
    ],
    "container-output-file-keys": [
        {
            "filename": "{{global_superstore_orders_filename}}",
            "key": "s3_datasink.mapping1"
        }
    ],
    "delete-container-when-complete": false
    "tests": {
        "log_dockerhub_tool_instance": {
            "test-variable": "dockerhubConfig",
            "action": "log",
            "type": "test-contents-as-string",
            "test-logic": "dockerhubConfig",
            "keep-history": true,
            "description": "Logs the DockerHub tool instance."
        },
        "test_success": {
            "test-variable": "success",
            "action": "stop-on-error",
            "type": "test-contents-as-boolean",
            "test-logic": "success",
            "keep-history": true,
            "description": "Stops the OrderRun if success is False."
        }
    }
}

Vault Secrets in the GPC

The GPC allows the use of secrets only if injected by parameters (for Python scripts) or environment variables (for shell scripts or Python scripts) configured in config.json.

The following example of a partial config file, shows how a password can be injected with a vault expression.

{
    "keys": {
        "tableau_script": {
            "script": "tableau_test.py",
            "parameters": {
                "PASSWORD": "#{vault://tableau/password}"
            },
            "export": [
                "success"
            ]
        }
    }
}

Do NOT Add Secrets to a Script!

If you place secrets in a script directly, use variables to resolve secrets, or place vault expressions in a script, that script will fail. The system will generate a warning message and ignore the input. This validation is a security measure to protect your toolchain credentials.

Logging

The GPC includes LOGGER, a global logging object which wraps Python3 logging. You can use LOGGER to record info and error messages directly into the searchable order run logs.

The logger supports a subset of methods.

  • Printing to logs using LOGGER.info().
  • LOGGER.setLevel(), where the default level is warning. This method requires import logging in script to change the default level.
    Example: change to DEBUG
    LOGGER.setLevel(logging.DEBUG)
    
  • Alternate logging levels
    LOGGER.debug()
    LOGGER.warning()
    LOGGER.error()
    LOGGER.critical()
    

GPC Resource Allocation

Users typically rely on the resources made available to a kitchen by an agent. The DataKitchen agents have sufficient memory and disk space allocated to run the majority of recipe orders.

For resource-intensive order runs, particularly those with large queries, heavy use of variables, or data sources and sinks, users can adjust memory and disk space allocations in several ways.

  • Resources for all Order Runs: On the Recipes page, select a recipe, click the Schedules tab, then add or edit a schedule. Enter resource values at the bottom of the Add Schedule or Edit Schedule dialog.
  • Resources for Scheduled Orders: View a variation and click the Schedule tab.
  • Resources for Container Node Executions: Open the Node Editor for a container and click on Container Settings in the Configuration tab.

Troubleshooting Resource Problems
Two issues may occur with resource allocation, specifically for container nodes because these nodes can be configured to perform heavy data processing.

  • Allocated resources are greater than what is available
  • The resource allocation is insufficient to execute the order run

In the first case when a user configures resource usage that exceeds the resources available for a single node, the system will not complete the schedule and generates a warning message about "not enough allocatable resources."

In the second case, an order run fails in an error state because it does not have enough memory or disk space to complete the processing. The user must use log entries and metrics to determine how much memory and disk space to add for the node.

  1. On the Order Run Details page, view the logs related to the failure.
    • Click the Logs icon in the order run toolbar to view log entries for the entire order run.
    • Click a failed node, and scroll to the Logs section of the Node Details sidebar.
  2. Search the logs for "order run terminated" errors similar to the following.
  1. Click to toggle on the RAM and Disk usage metrics.

See Runtime Resources for more information.

GPC Usage Examples

Basic Example

This example demonstrates the use of variables, adding tests, and general configuration.

  • The config.json defines a Python script and a shell script to execute. (The script names must include file extensions, as in .py or .sh.)
  • The basic.py script defines "success" as a global variable for export that will be used in tests. It then checks that each environment variable and parameter you want to inject exists in the Python dictionary or globals array and are defined. Finally, the script logs the values injected using Jinja expressions.
  • The basic.sh script checks that the environment variables are defined, then it prints the variable values. Finally, it writes a value to a file for use in testing.
{
    "apt-dependencies": [ ],
    "dependencies": [ ],
    "keys": {
        "python_script": {
            "script": "basic.py",
            "environment": {
                "SIMPLE_ENV_VAR": "simple_env_var",
                "JINJA_ENV_VAR": "{{basic_example_node.RECIPE_VAR}}",
                "VAULT_ENV_VAR": "#{vault://vault/url}"
            },
            "parameters": {
                "SIMPLE_PARAM": "simple_param",
                "JINJA_PARAM": "{{basic_example_node.RECIPE_VAR}}",
                "VAULT_PARAM": "#{vault://vault/url}"
            },
            "export": [
                "success"
            ]
        },
        "shell_script": {
            "script": "basic.sh",
            "environment": {
                "SIMPLE_ENV_VAR": "simple_env_var",
                "JINJA_ENV_VAR": "{{basic_example_node.RECIPE_VAR}}",
                "VAULT_ENV_VAR": "#{vault://vault/url}"
            }
        }
    }
}
import os
import sys
import traceback

global success


# Validate environment variables
if 'SIMPLE_ENV_VAR' not in os.environ or not os.environ['SIMPLE_ENV_VAR']:
    LOGGER.error("Undefined SIMPLE_ENV_VAR")
    sys.exit(1)

if 'JINJA_ENV_VAR' not in os.environ or not os.environ['JINJA_ENV_VAR']:
    LOGGER.error("Undefined JINJA_ENV_VAR")
    sys.exit(1)

if 'VAULT_ENV_VAR' not in os.environ or not os.environ['VAULT_ENV_VAR']:
    LOGGER.error("Undefined VAULT_ENV_VAR")
    sys.exit(1)

# Validate parameters
if 'SIMPLE_PARAM' not in globals() or not SIMPLE_PARAM:
    LOGGER.error("Undefined SIMPLE_PARAM")
    sys.exit(1)

if 'JINJA_PARAM' not in globals() or not JINJA_PARAM:
    LOGGER.error("Undefined JINJA_PARAM")
    sys.exit(1)

if 'VAULT_PARAM' not in globals() or not VAULT_PARAM:
    LOGGER.error("Undefined VAULT_PARAM")
    sys.exit(1)

try:
    LOGGER.info(f'SIMPLE_ENV_VAR: {os.environ["SIMPLE_ENV_VAR"]}') 
    LOGGER.info(f'JINJA_ENV_VAR: {os.environ["JINJA_ENV_VAR"]}')
    LOGGER.info(f'VAULT_ENV_VAR: {os.environ["VAULT_ENV_VAR"]}')
    LOGGER.info(f'SIMPLE_PARAM: {SIMPLE_PARAM}')
    LOGGER.info(f'JINJA_PARAM: {JINJA_PARAM}')
    LOGGER.info(f'VAULT_PARAM: {VAULT_PARAM}')
    LOGGER.info('EMBEDDED JINJA: {{basic_example_node.RECIPE_VAR}}')
    success = True
except Exception as e:
    LOGGER.error(f'Failed to read and log variables:\n{traceback.format_exc()}')
    success = False
#!/usr/bin/env bash

# Validate environment variables
if [ -z "$SIMPLE_ENV_VAR" ]; then
    echo "Undefined SIMPLE_ENV_VAR"
    exit 1
fi

if [ -z "$JINJA_ENV_VAR" ]; then
    echo "Undefined JINJA_ENV_VAR"
    exit 1
fi

if [ -z "$VAULT_ENV_VAR" ]; then
    echo "Undefined VAULT_ENV_VAR"
    exit 1
fi

# Print variables
echo "SIMPLE_ENV_VAR: $SIMPLE_ENV_VAR"
echo "JINJA_ENV_VAR: $JINJA_ENV_VAR"
echo "VAULT_ENV_VAR: $VAULT_ENV_VAR"
echo "EMBEDDED JINJA: {{basic_example_node.RECIPE_VAR}}"

# Write to a file to demonstrate file export and testing
echo -n  "shell_script_output_value" > "docker-share/shell_script_output.txt"
{
    "image-repo": "{{dockerhubConfig.image_repo.general_purpose}}",
    "image-tag": "{{dockerhubConfig.image_tag.general_purpose}}",
    "dockerhub-namespace": "{{dockerhubConfig.namespace.general_purpose}}",
    "dockerhub-username": "{{dockerhubConfig.username}}",
    "dockerhub-password": "{{dockerhubConfig.password}}",
    "assign-variables": [
        {
            "name": "shell_script_output",
            "file": "shell_script_output.txt"
        }
    ],
    "tests": {
        "log_dockerhub_tool_instance": {
            "test-variable": "dockerhubConfig",
            "action": "log",
            "type": "test-contents-as-string",
            "test-logic": "dockerhubConfig",
            "keep-history": true,
            "description": "Logs the DockerHub tool instance."
        },
        "test_success": {
            "test-variable": "success",
            "action": "stop-on-error",
            "type": "test-contents-as-boolean",
            "test-logic": "success",
            "keep-history": true,
            "description": "Stops the OrderRun if success is False."
        },
        "test_shell_script_output": {
            "test-variable": "shell_script_output",
            "action": "stop-on-error",
            "type": "test-contents-as-string",
            "test-logic": "shell_script_output == 'shell_script_output_value'",
            "keep-history": true,
            "description": "Stops the OrderRun if the shell script output is unexpected."
        }
    }
}
A tour of the Web App configuration for the basic example.

A tour of the Web App configuration for the basic example.

Python Dependency Example

This use of the GPC adds a Python dependency to the container in order to interact with an external tool, such as Tableau.

  • The config.json defines a Python dependency as well as a vault value for accessing Tableau. The tableauserverclient tool defined in config/json is for example purposes only. This version of the package is actually pre-installed in the GPC.
  • The tableau_test.py script checks that the password parameter you want to inject exists in the globals array and is defined. This script also has some Jinja variables embedded directly.
{
    "apt-dependencies": [ ],
    "dependencies": [
        "tableauserverclient==0.11"
    ],
    "keys": {
        "tableau_script": {
            "script": "tableau_test.py",
            "environment": {},
            "parameters": {
                "PASSWORD": "#{vault://tableau/password}"
            },
            "export": [
                "success"
            ]
        }
    }
}
import os
import traceback

import tableauserverclient as TSC

global success


# Validate parameters
if 'PASSWORD' not in globals() or not PASSWORD:
    LOGGER.error("Undefined PASSWORD")
    sys.exit(1)

try:
    # create an auth object
    tableau_auth = TSC.TableauAuth('{{tableauConfig.username}}', PASSWORD, '{{tableauConfig.content_url}}')
    
    # create an instance for your server
    server = TSC.Server('{{tableauConfig.url}}')
    
    # sign in to the tableau server
    server.auth.sign_in(tableau_auth)

    LOGGER.info(f'Tableau Server Version: {server.version}')
    success = True
except Exception as e:
    LOGGER.error(f'Failed to connect to Tableau:\n{traceback.format_exc()}')
    success = False
{
    "image-repo": "{{dockerhubConfig.image_repo.general_purpose}}",
    "image-tag": "{{dockerhubConfig.image_tag.general_purpose}}",
    "dockerhub-namespace": "{{dockerhubConfig.namespace.general_purpose}}",
    "dockerhub-username": "{{dockerhubConfig.username}}",
    "dockerhub-password": "{{dockerhubConfig.password}}",
    "tests": {
        "log_dockerhub_tool_instance": {
            "test-variable": "dockerhubConfig",
            "action": "log",
            "type": "test-contents-as-string",
            "test-logic": "dockerhubConfig",
            "keep-history": true,
            "description": "Logs the DockerHub tool instance."
        },
        "test_success": {
            "test-variable": "success",
            "action": "stop-on-error",
            "type": "test-contents-as-boolean",
            "test-logic": "success",
            "keep-history": true,
            "description": "Stops the OrderRun if success is False."
        }
    }
}
A tour of the Web App configuration for the Python dependency example.

A tour of the Web App configuration for the Python dependency example.

Shared Resources Example

This use of the GPC demonstrates how to use a shared resource from a recipe's resources directory. The files in that directory are available to all nodes within a recipe.

The only files that are automatically injected in a container node are those that exist inside the node file structure, such as scripts in the docker-share directory. Files intended to be shared across nodes must be placed in the resources directory and, therefore, must be explicitly loaded to a node. By adding a dictionary data source to the container node, you can access a shared resource at runtime.

File Structure

    description.json
    resources
        README.txt
        python_scripts
            resources_basic.py
    shared_resources_example
        data_sources
            dict_datasource.json
        description.json
        docker-share
            config.json
        notebook.json
    variables.json
    variations.json

File Descriptions

  • The config.json defines which file you want to use in the script field. The dict_datasource.json actually determines where the file will go.
  • The resources_basic.py file, stored in the python_scripts subdirectory of the recipe's resources directory, is the shared file that this example calls.
  • The dict_datasource.json defines your source connection as a dictionary data source and identifies a bucket-name, which is helpful if sharing a resource across nodes. It uses a Jinja expression to define the file load. The default destination within the container is the docker-share directory, so the target path is not specified.
    • Using the Node Editor in the DataKitchen UI, you would define the same values in the Source Connections section of the Connections tab and in the Source > JSON Value and Container > Target File Path fields of the Inputs tab.
{
    "apt-dependencies": [ ],
    "dependencies": [ ],
    "keys": {
        "python_script": {
            "script": "resources_basic.py",
            "environment": {
                "SIMPLE_ENV_VAR": "simple_env_var",
                "JINJA_ENV_VAR": "{{basic_example_node.RECIPE_VAR}}",
                "VAULT_ENV_VAR": "#{vault://vault/url}"
            },
            "parameters": {
                "SIMPLE_PARAM": "simple_param",
                "JINJA_PARAM": "{{basic_example_node.RECIPE_VAR}}",
                "VAULT_PARAM": "#{vault://vault/url}"
            },
            "export": [
                "success"
            ]
        }
    }
}
import os
import sys
import traceback

global success


# Validate environment variables
if 'SIMPLE_ENV_VAR' not in os.environ or not os.environ['SIMPLE_ENV_VAR']:
    LOGGER.error("Undefined SIMPLE_ENV_VAR")
    sys.exit(1)

if 'JINJA_ENV_VAR' not in os.environ or not os.environ['JINJA_ENV_VAR']:
    LOGGER.error("Undefined JINJA_ENV_VAR")
    sys.exit(1)

if 'VAULT_ENV_VAR' not in os.environ or not os.environ['VAULT_ENV_VAR']:
    LOGGER.error("Undefined VAULT_ENV_VAR")
    sys.exit(1)

# Validate parameters
if 'SIMPLE_PARAM' not in globals() or not SIMPLE_PARAM:
    LOGGER.error("Undefined SIMPLE_PARAM")
    sys.exit(1)

if 'JINJA_PARAM' not in globals() or not JINJA_PARAM:
    LOGGER.error("Undefined JINJA_PARAM")
    sys.exit(1)

if 'VAULT_PARAM' not in globals() or not VAULT_PARAM:
    LOGGER.error("Undefined VAULT_PARAM")
    sys.exit(1)

try:
    LOGGER.info(f'SIMPLE_ENV_VAR: {os.environ["SIMPLE_ENV_VAR"]}')
    LOGGER.info(f'JINJA_ENV_VAR: {os.environ["JINJA_ENV_VAR"]}')
    LOGGER.info(f'VAULT_ENV_VAR: {os.environ["VAULT_ENV_VAR"]}')
    LOGGER.info(f'SIMPLE_PARAM: {SIMPLE_PARAM}')
    LOGGER.info(f'JINJA_PARAM: {JINJA_PARAM}')
    LOGGER.info(f'VAULT_PARAM: {VAULT_PARAM}')
    LOGGER.info('EMBEDDED JINJA: {{basic_example_node.RECIPE_VAR}}')
    success = True
except Exception as e:
    LOGGER.error(f'Failed to read and log variables:\n{traceback.format_exc()}')
    success = False
{
    "image-repo": "{{dockerhubConfig.image_repo.general_purpose}}",
    "image-tag": "{{dockerhubConfig.image_tag.general_purpose}}",
    "dockerhub-namespace": "{{dockerhubConfig.namespace.general_purpose}}",
    "dockerhub-username": "{{dockerhubConfig.username}}",
    "dockerhub-password": "{{dockerhubConfig.password}}",
    "container-input-file-keys": [
        {
            "key": "dict_datasource.resources_basic",
            "filename": "resources_basic.py"
        }
    ],
    "tests": {
        "log_dockerhub_tool_instance": {
            "description": "Logs the DockerHub tool instance.",
            "action": "log",
            "test-variable": "dockerhubConfig",
            "type": "test-contents-as-string",
            "test-logic": "dockerhubConfig",
            "keep-history": true
        },
        "test-success": {
            "description": "Stops the OrderRun if success is False.",
            "action": "stop-on-error",
            "test-variable": "success",
            "type": "test-contents-as-boolean",
            "test-logic": "success",
            "keep-history": true
        }
    }
}
{
    "type": "DKDataSource_Dictionary",
    "name": "dict_datasource",
    "bucket-name": "shared-data",
    "keys": {
        "resources_basic": "{{load_text('python_scripts/resources_basic.py')}}"
    }
}

Source-Sink Example

This use of the GPC shows an active source and sink connected to the container. You can read in files from any available data source, load them into a container, transform them using a script, then export them to any data sink.

In this example, the process imports a CSV file from an S3 bucket into the container, modifies the contents of the file, tests the row count to ensure it has not changed, and exports it back to S3.

  • The config.json defines a shell script, which will make a modification to a file.
  • The s3_datasource.json and s3_datasink.json identify the input and output connections for the container as S3 buckets, using secrets defined in the Kitchen Overrides and vault. They use Jinja expressions with variables defined in the recipe's variables.json for filename and path values in the mappings. The data source mapping sets a row_count runtime variable for use in testing.
    • Using the Node Editor in the DataKitchen UI, you would define the same values in the Source Connections section of the Connections tab, in the Source > JSON Value and Container > Target File Path fields of the Inputs tab, and in the Container > Container File Path and Sink > Target File Path fields of the Outputs tab.
  • The transform_data.sh executes a simple search-and-replace function on every line in the loaded file to append the filename at the end.
  • The notebook.json includes input and output file keys
{
    "apt-dependencies": [ ],
    "dependencies": [ ],
    "keys": {
        "run_shell_script": {
            "script": "transform_data.sh",
            "parameters": {},
            "environment": {}
        }
    }
}
{
    "name": "s3_datasource",
    "type": "DKDataSource_S3",
    "config": {
        "access-key": "{{s3Config.access_key}}",
        "secret-key": "{{s3Config.secret_key}}",
        "bucket": "{{s3Config.bucket}}"
    },
    "keys": {
        "mapping1": {
            "file-key": "{{source_sink_example_node.source_filepath}}",
            "use-only-file-key": true,
            "set-runtime-vars": {
                "row_count": "source_row_count"
            }
        }
    },
    "tests": {
        "test_source_row_count": {
            "action": "stop-on-error",
            "test-variable": "source_row_count",
            "type": "test-contents-as-integer",
            "test-logic": {
                "test-compare": "greater-than",
                "test-metric": 99
            }
        }
    }
}
{
    "name": "s3_datasink",
    "type": "DKDataSink_S3",
    "config": {
        "access-key": "{{s3Config.access_key}}",
        "secret-key": "{{s3Config.secret_key}}",
        "bucket": "{{s3Config.bucket}}"
    },
    "keys": {
        "mapping1": {
            "file-key": "{{source_sink_example_node.sink_filepath}}",
            "use-only-file-key": true,
            "set-runtime-vars": {
                "row_count": "sink_row_count"
            }
        }
    },
    "tests": {
        "test_source_sink_rowcounts_match": {
            "action": "stop-on-error",
            "test-variable": "sink_row_count",
            "type": "test-contents-as-integer",
            "test-logic": "sink_row_count == {{source_row_count}}"
        }
    }
}
#!/bin/bash

# Navigate to docker-share directory where source files were added
cd ./docker-share
echo "The files injected into $(pwd):"
ls

# Append column to CSV file containing filename
sed -i "s/$/,{{global_superstore_orders_filename}}/" {{global_superstore_orders_filename}}
{
    "image-repo": "{{dockerhubConfig.image_repo.general_purpose}}",
    "image-tag": "{{dockerhubConfig.image_tag.general_purpose}}",
    "dockerhub-namespace": "{{dockerhubConfig.namespace.general_purpose}}",
    "dockerhub-username": "{{dockerhubConfig.username}}",
    "dockerhub-password": "{{dockerhubConfig.password}}",
    "container-input-file-keys": [
        {
            "filename": "{{global_superstore_orders_filename}}",
            "key": "s3_datasource.mapping1"
        }
    ],
    "container-output-file-keys": [
        {
            "filename": "{{global_superstore_orders_filename}}",
            "key": "s3_datasink.mapping1"
        }
    ]
}

Updated 8 days ago

General Purpose Container


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.