DataKitchen DataOps Documention

Azure Data Lake Storage Gen2

Tool Documentation

Source & Sink Category

Azure Data Lake Storage Data Sources and Sinks are of the file-based category.

Type Codes

Azure Data Lake Storage
Type Code

Data Source

DKDataSource_ADLS2

Data Sink

DKDataSink_ADLS2

Connection Properties

In order to use the DataKitchen connector, you must have a pre-existing filesystem (sometimes referred to as "container").

Field Name
Scope
Field Type
Required?
Description

connection_string

Source/Sink

String

Required

Secret access string used to identify Azure account and permissions

filesystem

Source/Sink

String

Required

Name of filesystem/container to operate on. File system must exist before writing to the path with ADLS connector.

Local Connection Settings

You can access your Azure Storage account settings to find access keys and connection strings. Access keys are basically the credentials for your storage, and connection strings contain the information needed for the DataKitchen platform to connect and access data.

See Microsoft instructions to view and copy a connection string.

Define at Kitchen Level

{
"connection_string": "#{vault://adls2/connection_string}", 
"filesystem": "datakitchen-staging"
}

Expanded Syntax

{
    "type": "DKDataSource_ADLS2",
    "name": "azuredatalake_datasource",
    "connection_string": "{{adls2Config.connection_string}}",
    "filesystem": "{{adls2Config.filesystem}}",
    "keys": {
    	"blob_to_blob_source": {
      	   "file-key": "test_upload.json",
           "use-only-file-key": true,
           "set_runtime-vars": {
              "md5": "pre_upload_md5"
           }
        }
    }
}
{
    "type": "DKDataSink_ADLS2",
    "name": "azuredatalake_datasink",
    "connection_string": "{{adls2Config.connection_string}}",
    "filesystem": "{{adls2Config.filesystem}}",
    "keys": {
       "blob_to_blob_sink": {
           "file-key": "test_upload.json",
           "use-only-file-key": true,
           "set-runtime-vars": {
              "md5": "post_download_md5"
           }  
       }
    }
}

Condensed Syntax

Check Your Syntax

Do not use quotes for your condensed connection configuration variables.

{
    "type": "DKDataSource_ADLS2",
    "name": "azuredatalake_datasource",
    "config": {{adls2Config}},
    "keys": {},
    "tests": {}
}
{
    "type": "DKDataSink_ADLS2",
    "name": "azuredatalake_datasink",
    "config": {{adls2Config}},
    "keys": {},
    "tests": {}
}

Optional Properties

Common File-Based Category Properties

Field Name
Scope
Field Type
Description

cache

Key

Boolean

Caches the data of a given data source. Other data sources of the same type and Key name may leveraged cached data so as to prevent repetitive retrievals. This feature does not presently support OrderRun resume and is thus not recommended.

complete

Key

Boolean

decrypt-key

Key (Sources)

String

The key used to decrypt a file.

decrypt-passphrase

Key (Sources)

String

The key passphrase used to decrypt a file.

encrypt-key

Key (Sinks)

String

Specifies the key used to encrypt a file.

file-key

Key

String

Denotes the file being picked/pushed for explicit keys. The path is built using key/file-key.
Specifies either the file name or the full path of the file. See use-only-file-key.

set-runtime-vars

Source/Sink, Key

Dictionary

Used to declare runtime variables set equal to built-in variables. Can be applied at the Source/Sink level or the Key level.

use-only-file-key

Key

Boolean

Optional, with a default value of false. When set to true, it uses only the value file-key as path/file spec to reference a file, ignoring the key. Applied at the Source/Sink level when using wildcards. For explicitly specified Keys, used at the Key-level for each Key.

Wildcards

Field Name
Scope
Field Type
Description

wildcard

Source/Sink

String

Specifies a glob wildcard expression to pick/push a set of arbitrary files that match the declared wildcard expression. Wildcards apply only to a single directory; they are not recursive. Use multiple Data Sources or Data Sinks if pulling or pushing files across multiple directories.

wildcard-key-prefix

Source/Sink

String

Specifies the path prefix for the given wildcard expression. When using wildcard mappings on DataMapper node, it specifies the prefix with the base path where files will be stored.

Built-in Runtime Variables

Built-in Runtime Variable
Scope
Description

key_count

Source/Sink

Exports the total count of Keys in a Data Source or Data Sink, regardless of whether a Key is wildcard-generated or explicitly configured by a user.

key_files

Source/Sink

Exports a list of file names with paths for all the files that match the wildcard.

key_names

Source/Sink

Exports a list of key names generated by the wildcard.

row_count

Key

Exports the line count of any kind of text file (txt, csv, json, etc.) associated to a key. If a header is present, it is included in the row count value.

size

Key

Exports the size of a file associated to a key, in bytes.

md5

Key

Exports the MD5 hash of a file associated to a key.

sha

Key

Exports the SHA hash of a file associated to a key.

UTF-8 File Encoding Required

Files used with data sources and data sinks must be encoded in UTF-8 in order to avoid non-Unicode characters causing errors with row-count tests and problems with sinking data to database tables. For CSV and other delimited files, use the "save as" function in your application and select the proper encoding, or consider using a text editor with encoding options.

Data Source Example

The ADLS2 data source below loads all JSON blob files present in the wildcard/ directory with a wildcard key. It also loads the specific test_upload.json blob with a file key. The “adls2Config” variable defines the source account and container for these files.

The source, when finished loading the file, stores the file’s md5 hash in the post_download_md5 runtime variable. As a file integrity test, the source then compares post_download_md5 to a predefined pre_upload_md5 variable.

{
    "name": "source",
    "type": "DKDataSource_ADLS2",
    "config-ref": "adls2Config",
    "wildcard": "*.json",
    "wildcard-key-prefix": "wildcard/",
    "keys": {
        "azure_source": {
            "file-key": "test_upload.json",
            "use-only-file-key": true,
            "set-runtime-vars": {
                "md5": "post_download_md5"
            }
        }
    },
    "tests": {
        "verify_data": {
            "action": "stop-on-error",
            "test-variable": "pre_upload_md5",
            "type": "test-contents-as-string",
            "test-logic": "pre_upload_d5 == {{post_download_md5}}"
        }
    }
}

One could also run a test to compare the pre_upload and post_download runtime variables. See Tests for more information and examples.

Data Sink Example

The ADLS2 data sink below uploads a single file named test_upload.json to a blob on the Azure account and container defined by the adls2Config variable. After uploading, the sink stores the md5 hash of the file in the pre_upload_md5 runtime variable for later use.

{
    "name": "sink",
    "type": "DKDataSink_ADLS2",
    "config-ref": "adls2Config",
    "keys": {
        "azure_sink": {
            "file-key": "test_upload.json",
            "use-only-file-key": true,
            "set-runtime-vars": {
                "md5": "pre_upload_md5"
            }
        }
    }
}

Updated about a month ago

Azure Data Lake Storage Gen2


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.