Version: 3.4.0

s3

Stream data from a S3 Object

Available from Hotrod: 3.1

Field Name	Description	Type	Default
interval	How often to run the command	duration	-
cron	How often to run the command. Note that Hotrod uses a different format than Cron it includes a column for seconds. See full discussion	cron	-
immediate	Run as soon as invoked, instead of waiting for the specified cron interval	bool	false
random-offset	Sets a random offset to the schedule, then sticks to it	duration	0s
window	For resources that need a time window to be specified	Window	-
block	Block further input schedules from triggering if the pipe output is retrying	bool	false
bucket-name	The storage service container for created blobs	string	-
object-names	The name for the blob	array of strings	-
object-name-field	The field that a blob name from an operation should be stored in	field	-
creation-time-field	The field that the blob creation time should be stored in	field	-
last-modified-field	The field that the blob last modified time should be stored in	field	-
content-length-field	The field that the blob content length information should be stored in	field	-
content-type-field	The field that the blob content type information should be stored in	field	-
etag-field	The field that the object ETag should be stored in	field	-
data-field	A field that the blob data should be nested in	field	-
region	Region	string	-
endpoint	S3 Endpoint	string	-
access-key	Access Key ID	string	-
secret-key	Secret Key ID	string	-
security-token	Security Token	string	-
session-token	Session Token	string	-
timestamp-mode	Derive a timestamp for this blob for filtering purposes based on the selected strategy.	S3ObjectTimestampMode	-
maximum-age	Remove any object older than this many seconds from the candidate list	MaxAgeSpecifier	-
mode	The operating mode for this input	S3BlockInputMode	-
fingerprinting	Enable object fingerprinting, which will cause a object to only be downloaded once	bool	false
maximum-fingerprint-age	Remove any object fingerprints older than this from the tracker	MaxAgeSpecifier	30 days
preprocessors	Preprocessors (process downloaded data before making it available to the pipeline) these processors will be run in the order they are specified	PreProcessor	-

interval

How often to run the command

By default, interval: 0s which means: once. Note that scheduled inputs set document markers. See full discussion

Type: duration

Example

action:

exec:
  command: echo 'once a day'
  interval: 1d

cron

How often to run the command. Note that Hotrod uses a different format than Cron it includes a column for seconds. See full discussion

Type: cron

Example: Once a day

action:

exec:
  command: echo 'once a day'
  cron: '0 0 0 * * *'

Example: Once a day, using a convenient shortcut

action:

exec:
  command: echo 'once a day'
  cron: '@daily'

immediate

Run as soon as invoked, instead of waiting for the specified cron interval

Type: bool

Example: Run immediately on invocation, and thereafter at 10h every morning

action:

exec:
  command: echo 'hello'
  immediate: true
  cron: '0 0 10 * * *'

random-offset

Sets a random offset to the schedule, then sticks to it

This can help avoid the thundering herd problem, where you do not, for example, want to overload some service at 00:00:00

Type: duration

Example: Would fire up to a minute after every hour

action:

exec:
  command: echo 'hello'
  random-offset: 1m
  cron: '0 0 * * * *'

window

For resources that need a time window to be specified

Type: Window

Field Name	Description	Type	Default
size	Window size	duration	-
offset	Window offset	duration	0s
start-time	Allows the windowing to start at a specified time	time	-
highwatermark-file	Specify file where timestamp would be stored in order to resume, for when Pipe has been restarted	path	-

size

Window size

Type: duration

Example

action:

exec:
  command: echo 'one two'
  window:
    size: 1m

offset

Window offset

Type: duration

Example

action:

exec:
  command: echo 'one two'
  window:
    size: 1m
    offset: 10s

start-time

Allows the windowing to start at a specified time

It should in the following format: 2019-07-10 18:45:00.000 +0200

Type: time

Example

action:

exec:
  command: echo 'one two'
  window:
    size: 1m
    start-time: 10s

highwatermark-file

Specify file where timestamp would be stored in order to resume, for when Pipe has been restarted

Type: path

Example

action:

exec:
  command: echo 'one two'
  window:
    size: 1m
    highwatermark-file:: /tmp/mark.txt

timestamp-mode

Derive a timestamp for this blob for filtering purposes based on the selected strategy.

Type: S3ObjectTimestampMode

Field Name	Description	Type	Default
none	The default mode, do not filter object based on timestamps	-	-
last-modified	Filter object on the last-modified timestamp reported by the service	-	-
blob-name-pattern	Filter blobs on the timestamp derived from the object name for example: `object-name-pattern: =(?P<Y>[\\d]{4,4})-(?P<m>[\\d]{2,2})-(?P<d>[\\d]{2,2})/`	string	-

none

The default mode, do not filter object based on timestamps

last-modified

Filter object on the last-modified timestamp reported by the service

blob-name-pattern

Filter blobs on the timestamp derived from the object name for example: object-name-pattern: =(?P<Y>[\\d]{4,4})-(?P<m>[\\d]{2,2})-(?P<d>[\\d]{2,2})/

Type: string

maximum-age

Remove any object older than this many seconds from the candidate list

Type: MaxAgeSpecifier

Field Name	Description	Type	Default
seconds	Specify the maximum age in number of seconds	integer	-
duration	Specify the maximum age as a human readable duration (example: 1 hour)	string	-

seconds

Specify the maximum age in number of seconds

Type: integer

duration

Specify the maximum age as a human readable duration (example: 1 hour)

Type: string

mode

The operating mode for this input

Type: S3BlockInputMode

Field Name	Description	Type	Default
list-objects	List Objects	-	-
download-objects	Download Given Objects	-	-
list-and-download-objects	List Objects and Download	-	-

list-objects

List Objects

download-objects

Download Given Objects

list-and-download-objects

List Objects and Download

fingerprinting

Enable object fingerprinting, which will cause a object to only be downloaded once

Type: bool

maximum-fingerprint-age

Remove any object fingerprints older than this from the tracker

Type: MaxAgeSpecifier

Field Name	Description	Type	Default
seconds	Specify the maximum age in number of seconds	integer	-
duration	Specify the maximum age as a human readable duration (example: 1 hour)	string	-

seconds

Specify the maximum age in number of seconds

Type: integer

duration

Specify the maximum age as a human readable duration (example: 1 hour)

Type: string

preprocessors

Preprocessors (process downloaded data before making it available to the pipeline) these processors will be run in the order they are specified

Type: PreProcessor

Field Name	Description	Type	Default
extension	Preprocess the object or blob based on the extension of the object or blob name (.gz, .parquet)	-	-
gzip	UnGzip the received data	-	-
parquet	Extract the received data as JSON rows from a parquet file	-	-

extension

Preprocess the object or blob based on the extension of the object or blob name (.gz, .parquet)

gzip

UnGzip the received data

parquet

Extract the received data as JSON rows from a parquet file

s3

interval​

cron​

immediate​

random-offset​

window​

size​

offset​

start-time​

highwatermark-file​

block​

bucket-name​

object-names​

object-name-field​

creation-time-field​

last-modified-field​

content-length-field​

content-type-field​

etag-field​

data-field​

region​

endpoint​

access-key​

secret-key​

security-token​

session-token​

timestamp-mode​

none​

last-modified​

blob-name-pattern​

maximum-age​

seconds​

duration​

mode​

list-objects​

download-objects​

list-and-download-objects​

fingerprinting​

maximum-fingerprint-age​

seconds​

duration​

preprocessors​

extension​

gzip​

parquet​

interval

cron

immediate

random-offset

window

size

offset

start-time

highwatermark-file

block

bucket-name

object-names

object-name-field

creation-time-field

last-modified-field

content-length-field

content-type-field

etag-field

data-field

region

endpoint

access-key

secret-key

security-token

session-token

timestamp-mode

none

last-modified

blob-name-pattern

maximum-age

seconds

duration

mode

list-objects

download-objects

list-and-download-objects

fingerprinting

maximum-fingerprint-age

seconds

duration

preprocessors

extension

gzip

parquet