DataKitchen DataOps Documention

Data Sources & Data Sinks

Get and put data to and from recipe nodes with DataKitchen input/output connectors.

Structure

Data Sources and Data Sinks are node I/O components that connect to toolchain infrastructure in order to get data to and push data from recipe nodes. For tools with DataKitchen data source and sink support, users have the option of connecting to the tool via a container node or data source/sink. The latter provides a tighter coupling to the DataKitchen framework, empowering less technical users.

Data sources and sinks execute a series of substeps called Keys. Keys may be explicitly expressed within recipe configuration, generated pre-runtime by use of Jinja processing and built-in functions, or generated at runtime via wildcards. A number of built-in runtime variables may be declared in data sources and sinks for use in automated tests.

Multiple data sources and/or data sinks may be configured for any given node, though their use is altogether optional for certain node types. Each individual data source or sink is configured via a single JSON configuration file. Data source configurations generally reside within each node's /data_sources directory. Data sink configurations for each node reside in the /data_sinks directory. For action nodes, data source configurations reside within an /actions node directory.

Categories

Data source and data sink connections are classified as either file-based or of database type. The sole exception to this rule is for the OrderRun Information Data Source, which falls under a system category.

Order of Processing

Sources & Sinks

Within a given node, data sources are processed first, one-by-one, followed by the node's notebook.json, and finally, the node's data sinks, one-by-one. See file evaluation sequence documentation for additional details.

If multiple data sources are configured for a given node, they will process in a series of indeterminate order. The same applies to data sinks.

Recipe Graph Design Consideration

The processing order of data sources and sinks is an important consideration as to whether multiple sources or sinks should be included in a single node or broken out into multiple nodes.

Keys

Keys are executed in the order with which they are defined within any given data source or sink. For example, if three keys (A, B, C) are defined in that order, key A will process, then key B, and finally, key C. If key B fails, key A will process successfully, but key B will generate a run error, meaning key C will not be executed.

Tests

Data source and sink tests are evaluated in the order with which they are defined within any given data source or sink.

Supported Connectors

DataKitchen supports the following data sources and data sinks.

DataKitchen is Toolchain Agnostic

Does your organization use a tool not supported by one of the default DataKitchen Data Sources and Sinks (I/O connectors) shown below? DataKitchen still supports your tool! In these cases, use a Container Node.

Connection
Category
Data Source?
Data Sink?
Wildcard

File-based

Supported

Supported

Database

Supported

Supported

-

Database

Supported

Supported

-

System

Supported

Supported

-

File-based

Supported

Supported

File-based

Supported

Supported

Database

Supported

Supported

-

Database

Supported

Supported

-

Database

Supported

Supported

-

System

Supported

-

-

Database

Supported

Supported

-

File-based

Supported

Supported

Database

Supported

-

-

File-based

Supported

Supported

Database

Supported

Supported

-

Database

Supported

Supported

-

Base Configuration

The data source or sink type is defined by the type field of its JSON configuration file.

Data sources and sinks should be named according to DataKitchen's object-naming best practices.

Use Supported Characters When Naming DataSources

DataKitchen object names support alpha-numeric and underscore characters and are case-sensitive.

Match name and filename

The name field in the data source or sink configuration file should match the filename minus the .json extension.

Field Name
Scope
Field Type
Required?
Description

name

Source, Sink

String

Required

The name of the Data Source or Data Sink, as defined by its filename minus the .json extension. The value for name should follow DataKitchen object-naming syntax guidelines.

type

Source, Sink

String

Required

Denotes the type of Data Source or Data Sink.

{
    "type": "< DATA SOURCE/SINK TYPE CODE >",
    "name": "my_data_source_or_sink",
    "config": {{config}},
    "keys": {},
    "tests": {}
}

File-Based Wildcards

File-based data sources and sinks allow users to pick/push an arbitrary number of files by supporting a wildcard feature using Glob syntax.

Any given file-based data source or sink may include both explicit and wildcard-generated keys, with the wildcard and wildcard-key-prefix configured at the source or sink-level. Only one wildcard may be used for any given data source or sink.

	"wildcard_glob_by_file_type": "*.txt", 
	"wildcard_glob_by_datetime_variable": "{{my_datetime_var}}_*.csv"

The name(s) of the keys generated at runtime by use of a wildcard may be viewed as part of an order run's details and optionally set as a runtime variable list using key_names.

Wildcard-generated key names as seen from the order run details page.

Wildcard-generated key names as seen from the order run details page.

Data Sources & Sinks in Container Nodes

For container nodes, data sources retrieve data and place it inside the container via input files. Data sinks move data out of the container via output files.

Configuration for input and output files resides outside of data source and data sink configuration, within a container node's notebook.json file.

Connection Properties

The configuration fields required for connections to data sources and data sinks are specific to the source/sink type. These connection configurations can be expressed line-by-line in an expanded syntax or in a condensed syntax via the use of a configuration dictionary.

Expanded Syntax

{
    "type" : "DKDataSource_SFTP",
    "name" : "sftp_datasource",  
    "hostname" : "{{sftpconfg.hostname}}",
    "username" : "{{sftpconfg.username}}",
    "password" : "{{sftpconfg.password}}",
    "keys" : {},
    "tests": {}
}

Condensed Syntax

For all data source and data sink types, the condensed syntax is declared via a config field.

Use Source View for Condensed Syntax

The condensed syntax for connection properties for data sources and sinks is not yet supported by UI forms. Use the source view for these files.

{
    "type" : "DKDataSource_SFTP",
    "name" : "sftp_datasource", 
    "config" : {{sftpconfig}},
    "keys" : {},
    "tests": {}
}

Check Your Syntax

Do not use quotes for your condensed connection configuration variables.

Define at Kitchen Level

In support of clean deployments across segregated kitchen virtual environments, data source and data sink connection configurations should be defined at the kitchen level via recipeoverrides. These overrides may then be set to equal to secrets in a kitchen vault.

Deploy With Velocity & Quality

By declaring data source and sink connection properties at the kitchen level, a recipe can be developed and run in a development kitchen pointed at DEV infrastructure, then seamlessly deployed to a production kitchen pointed at PROD infrastructure, whilst maintaining strictly segregated environments.

{
    "recipeoverrides": {
        "sftpConfig": {
            "hostname": "#{vault://sftp/hostname}",
            "username": "#{vault://sftp/username}",
            "password": "#{vault://sftp/password}"
        }
    }
}

Optional Properties

File-Based Properties

These properties may be used for Azure Blob Storage, Azure Data Lake Storage, FTP, GCS, S3, and SFTP data sources and sinks (unless otherwise noted).

Field Name
Scope
Field Type
Description

cache

Source/Key

Boolean

Optional. Caches the data of a given File-based Data Source. Other downstream Data Sources of the same type may leverage cached data by referencing upstream Key names. Using the cache feature prevents repetitive file retrievals, but does not presently support OrderRun resume.

decrypt-key

Key (Sources)

String

Optional. The key used to decrypt a file.

decrypt-passphrase

Key (Sources)

String

Optional. The key passphrase used to decrypt a file.

encrypt-key

Key (Sinks)

String

Optional. Specifies a key used to encrypt the file.

file-key

Key

String

Optional. Denotes the file being picked/pushed for explicit keys. The path is built using key/file-key.
Specifies either the file name or the full path of the file. See use-only-file-key.

reconnect-interval

Source/Sink

Integer

Only SFTP.

retry-count

Source/Sink

Integer

Only SFTP.

set-runtime-vars

Source/Sink, Key

Dictionary

Used to declare runtime variables set equal to built-in variables. Can be applied at the Source/Sink level or the Key level.

use-only-file-key

Key

Boolean

Optional, with a default value of false. When set to true, it uses only the value file-key as path/file spec to reference a file, ignoring the key. Applied at the Source/Sink level when using wildcards. For explicitly specified Keys, used at the Key-level for each Key.

wildcard

Source, Sink

String

Optional. Specifies a glob) wildcard expression to pick/push a set of arbitrary files that match the declared wildcard expression. Wildcards apply only to a single directory; they are not recursive. Use multiple Data Sources or Data Sinks if pulling or pushing files across multiple directories.

wildcard-key-prefix

Source, Sink

String

Optional. Specifies the path prefix for the given wildcard expression.
When using wildcard mappings on DataMapper node, it specifies the prefix with the base path where files will be stored.

Database Properties

These properties may be used for BigQuery, DB2SQL, MSSQL, MySQL, PostgreSQL, Salesforce, Snowflake, and Teradata data sources and sinks in action, datamapper, and container nodes (unless otherwise noted).

Field Name
Scope
Field Type
Description

col-delimiter

Key

String

When using CSV format, it specifies the column delimiter.
Use C-Style escaping for tab, newline, and carriage return characters

Optional.

When using CSV format, it specifies the column delimiter. Use C-Style escaping for tab, newline, and carriage return characters.

first-row-column-names

Key

Boolean

When using CSV format as input, if true, will use the first row as column names, which can be either ignored or used as field names depending the case.

format

Key

String

Alphanumeric, Optional.
Defaults to binary

Specifies the output format for the data picked from a SQL database
Possible values:
csv : Exports the data as CSV
binary : Exports the data in an internal binary format. Useful when moving from SQL datasources to SQL data sinks.
json : Exports the data to JSON format

Specifies the format of the data being inserted. Valid options can be:
csv : CSV File.
binary: Binary representation of the data.

Binary format preserves better file types so is recommended when using DataMapper nodes to move data from different databases.

infer-types

Key

Boolean

Attempts to infer data types from the input data.

insert-column-names

Key

Boolean

Puts the column names in the output file.

insert-template

Key

String

Specifies the DML template used to insert data.

query-type

Key

String

Required. Specifies the type of operation to be performed by a key:
execute_query: Executes a query and exports the resulting rowset in the given format
execute_row: Executes a query just for one single row
execute_scalar: Performs a query for a scalar value, like sum(), count(), etc.
execute_non_query: Intended for DML/DDL statements.
execute_dml: Uses a insert template string for creating the insert sentence.
bulk_insert: Performs a bulk insert from the input data.

resource-file

Key

String

The path of a file in resources/ folder which contains the SQL sentences to execute.
When this field is set, also add "target-field": "sql".

Do not set field sql when setting this field.

DEPRECATION WARNING: This field is deprecated.
Use "sql":"{{load_text(...)}}" instead

row-delimiter

Key

String

When using CSV format, it specifies the row delimiter. Use C-Style escaping for tab, newline, and carriage return characters.

Alphanumeric, Optional, defaults to DOS newline.

When using CSV format, it specifies the row delimiter. Use C-Style escaping for tab, newline, and carriage return characters.

set-runtime-vars

Source/Sink, Key

Dictionary

Used to declare runtime variables set equal to built-in variables. Can be applied at the Source/Sink level or the Key level.

sql

Key

String

The SQL/DML/DDL sentence to be executed. Compiles at the start of data source execution, so it cannot call variables created during processing. Supported only for database sources.
Not required if "resource-file" specified. See Data Source SQL Fields.

sql-file

Key

String

Specifies the resource file containing runtime variables generated in previous steps within the node. Compiles at the time of step execution, so it can refer to variables created in upstream steps. Supported only for database sources. See Data Source SQL Fields.

table-ddl

Key

String

Specifies the DDL sentences for creating the table.

table-name

Key

String

Specifies the name of the table where the data will be inserted.

target-field

Key

String

Only when resource field is specified. Value must be "sql". Old config; supported but no longer used extensively in favor of more concise options.

Built-In Runtime Variables

Declaration

Data sources and sinks provide a number of optional built-in runtime variables as a result of their execution, which can be leveraged in automated tests for quality assurance. Runtime variable availability is specific to the category of data source or sink.

Built-in runtime variables are explicitly declared in data sources and data sinks via the set-runtime-vars property.

Field Name
Scope
Field Type
Description

set-runtime-vars

Source/Sink, Key

Dictionary

Sets runtime variables.
Each Data Source or Data Sink may provide different runtime variable options.

The example below is for a file-based FTP data source. Here, built-in variables are declared both at the Source-level for files picked via wildcard and at the key-level for the file picked via an explicit Key.

{
    "type": "DKDataSource_FTP",
    "name": "ftp_datasource",
    "config": {{ftpconfig}},
    "set-runtime-vars": {
        "key_count": "count_ftp_files"
    }, 
    "wildcard": "*",
    "keys": {
        "example-explicit-key": {
            "file-key": "example.csv",
            "use-only-file-key": true,
            "set-runtime-vars": {
                "row_count": "ftp_example_row_count"
            }
        }
    },
    "tests": {
        "test-count-files-pulled": { 
            "test-variable": "count_ftp_files",
            "action": "stop-on-error",
            "test-logic": "count_ftp_files = 1 or count_ftp_files > 1", 
            "keep-history": true
        }, 
        "test-row-count":{
            "test-variable": "ftp_example_row_count",
            "action": "stop-on-error",
            "test-logic": "ftp_example_row_count > 100",
            "keep-history": true
        }
    }
}

Do #DataOps With Automated Testing

Declare tests like those seen in the example above to build confidence in the quality of your recipes' output.

File-Based Variables

These variables may be used with Azure Blob Storage, Azure Data Lake Storage, FTP, GCS, S3, and SFTP data sources and sinks.

Built-in Runtime Variable
Scope
Description

key_count

Source/Sink

Exports the total count of Keys in a Data Source or Data Sink, regardless of whether a Keys is wildcard-generated or explicitly configured by a user.

key_files

Source/Sink

Exports a list of file names with paths for all the files that match the wildcard.

key_names

Source/Sink

Exports a list of key names generated by the wildcard.

key_map

Source/Sink

Exports all keys in the data source in the form of a dictionary of key: value pairs.
If a data source/sink uses wildcards it will be populated with the keys resolved after wildcard is processed.

row_count

Key

Exports the line count of any kind of text file (txt, csv, json, etc.) associated to a key.

Note: If a header row is present, it is included in the row count value.

size

Key

Exports the size of a file associated to a key, in bytes.

md5

Key

Exports the MD5 hash of a file associated to a key.

sha

Key

Exports the SHA hash of a file associated to a key.

Source/sink-level built-in runtime variables for file-based data sources & data sinks.

Source/sink-level built-in runtime variables for file-based data sources & data sinks.

Key-level built-in runtime variables for file-based data sources & data sinks.

Key-level built-in runtime variables for file-based data sources & data sinks.

Database Variables

These variables may be used with Big Query, DB2SQL, MSSQL, MySQL, PostgreSQL, Salesforce, Snowflake, and Teradata data sources and sinks.

Built-in Runtime Variable
Scope
Description

key_count

Source/Sink

The count of all keys declared in a given Data Source or Data Sink.

key_names

Source/Sink

Exports a list of key names of the data source.

key_map

Source/Sink

Exports all keys in the data source in the form of a dictionary of key: value pairs.

column_count

Key

The count of columns for data associated to a Key. Only available for query-type that is either execute-query or execute-row.

result

Key

The result of query-type execute-scalar. Not available for Salesforce Data Source.

row_count

Key

The count of rows for data associated to a Key. Only available for query-type that is either execute-query or execute-row.

Note: Table headers (insert-column-names set to "true") are NOT included in the row count value.

Source/sink-level built-in runtime variables for data sources & data sinks in the database category.

Source/sink-level built-in runtime variables for data sources & data sinks in the database category.

Key-level built-in runtime variables for data sources & data sinks in the database category.

Key-level built-in runtime variables for data sources & data sinks in the database category.

Row Count Test Results May Not Match

When row_count variables are used with file-based data sources and sinks, the resulting value includes a header row if one is present. Conversely, when row_count variables are used with database sources and sinks, the resulting value does not count the header row when database table headers are present. A database table header row is only included in the data set if the property key insert-column-names is set to "true."

Updated 5 days ago


Next Up:

BigQuery

Data Sources & Data Sinks


Get and put data to and from recipe nodes with DataKitchen input/output connectors.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.