Skip to main content
Version: 3.4.0

s3

Stream data from a S3 Object

Available from Hotrod: 3.1

Field NameDescriptionTypeDefault
intervalHow often to run the commandduration-
cronHow often to run the command. Note that Hotrod uses a different format than Cron it includes a column for seconds. See full discussioncron-
immediateRun as soon as invoked, instead of waiting for the specified cron intervalboolfalse
random-offsetSets a random offset to the schedule, then sticks to itduration0s
windowFor resources that need a time window to be specifiedWindow-
blockBlock further input schedules from triggering if the pipe output is retryingboolfalse
bucket-nameThe storage service container for created blobsstring-
object-namesThe name for the blobarray of strings-
object-name-fieldThe field that a blob name from an operation should be stored infield-
creation-time-fieldThe field that the blob creation time should be stored infield-
last-modified-fieldThe field that the blob last modified time should be stored infield-
content-length-fieldThe field that the blob content length information should be stored infield-
content-type-fieldThe field that the blob content type information should be stored infield-
etag-fieldThe field that the object ETag should be stored infield-
data-fieldA field that the blob data should be nested infield-
regionRegionstring-
endpointS3 Endpointstring-
access-keyAccess Key IDstring-
secret-keySecret Key IDstring-
security-tokenSecurity Tokenstring-
session-tokenSession Tokenstring-
timestamp-modeDerive a timestamp for this blob for filtering purposes based on the selected strategy.S3ObjectTimestampMode-
maximum-ageRemove any object older than this many seconds from the candidate listMaxAgeSpecifier-
modeThe operating mode for this inputS3BlockInputMode-
fingerprintingEnable object fingerprinting, which will cause a object to only be downloaded onceboolfalse
maximum-fingerprint-ageRemove any object fingerprints older than this from the trackerMaxAgeSpecifier30 days
preprocessorsPreprocessors (process downloaded data before making it available to the pipeline) these processors will be run in the order they are specifiedPreProcessor-

interval

How often to run the command

By default, interval: 0s which means: once. Note that scheduled inputs set document markers. See full discussion

Type: duration

Example

action:

exec:
command: echo 'once a day'
interval: 1d

cron

How often to run the command. Note that Hotrod uses a different format than Cron it includes a column for seconds. See full discussion

Type: cron

Example: Once a day

action:

exec:
command: echo 'once a day'
cron: '0 0 0 * * *'

Example: Once a day, using a convenient shortcut

action:

exec:
command: echo 'once a day'
cron: '@daily'

immediate

Run as soon as invoked, instead of waiting for the specified cron interval

Type: bool

Example: Run immediately on invocation, and thereafter at 10h every morning

action:

exec:
command: echo 'hello'
immediate: true
cron: '0 0 10 * * *'

random-offset

Sets a random offset to the schedule, then sticks to it

This can help avoid the thundering herd problem, where you do not, for example, want to overload some service at 00:00:00

Type: duration

Example: Would fire up to a minute after every hour

action:

exec:
command: echo 'hello'
random-offset: 1m
cron: '0 0 * * * *'

window

For resources that need a time window to be specified

Type: Window

Field NameDescriptionTypeDefault
sizeWindow sizeduration-
offsetWindow offsetduration0s
start-timeAllows the windowing to start at a specified timetime-
highwatermark-fileSpecify file where timestamp would be stored in order to resume, for when Pipe has been restartedpath-

size

Window size

Type: duration

Example

action:

exec:
command: echo 'one two'
window:
size: 1m

offset

Window offset

Type: duration

Example

action:

exec:
command: echo 'one two'
window:
size: 1m
offset: 10s

start-time

Allows the windowing to start at a specified time

It should in the following format: 2019-07-10 18:45:00.000 +0200

Type: time

Example

action:

exec:
command: echo 'one two'
window:
size: 1m
start-time: 10s

highwatermark-file

Specify file where timestamp would be stored in order to resume, for when Pipe has been restarted

Type: path

Example

action:

exec:
command: echo 'one two'
window:
size: 1m
highwatermark-file:: /tmp/mark.txt

block

Block further input schedules from triggering if the pipe output is retrying

Type: bool

bucket-name

The storage service container for created blobs

Type: string

object-names

The name for the blob

Type: array of strings

object-name-field

The field that a blob name from an operation should be stored in

Type: field

creation-time-field

The field that the blob creation time should be stored in

Type: field

last-modified-field

The field that the blob last modified time should be stored in

Type: field

content-length-field

The field that the blob content length information should be stored in

Type: field

content-type-field

The field that the blob content type information should be stored in

Type: field

etag-field

The field that the object ETag should be stored in

Type: field

data-field

A field that the blob data should be nested in

Type: field

region

Region

Type: string

endpoint

S3 Endpoint

Type: string

access-key

Access Key ID

Type: string

secret-key

Secret Key ID

Type: string

security-token

Security Token

Type: string

session-token

Session Token

Type: string

timestamp-mode

Derive a timestamp for this blob for filtering purposes based on the selected strategy.

Type: S3ObjectTimestampMode

Field NameDescriptionTypeDefault
noneThe default mode, do not filter object based on timestamps--
last-modifiedFilter object on the last-modified timestamp reported by the service--
blob-name-patternFilter blobs on the timestamp derived from the object name for example: object-name-pattern: =(?P<Y>[\\d]{4,4})-(?P<m>[\\d]{2,2})-(?P<d>[\\d]{2,2})/string-

none

The default mode, do not filter object based on timestamps

last-modified

Filter object on the last-modified timestamp reported by the service

blob-name-pattern

Filter blobs on the timestamp derived from the object name for example: object-name-pattern: =(?P<Y>[\\d]{4,4})-(?P<m>[\\d]{2,2})-(?P<d>[\\d]{2,2})/

Type: string

maximum-age

Remove any object older than this many seconds from the candidate list

Type: MaxAgeSpecifier

Field NameDescriptionTypeDefault
secondsSpecify the maximum age in number of secondsinteger-
durationSpecify the maximum age as a human readable duration (example: 1 hour)string-

seconds

Specify the maximum age in number of seconds

Type: integer

duration

Specify the maximum age as a human readable duration (example: 1 hour)

Type: string

mode

The operating mode for this input

Type: S3BlockInputMode

Field NameDescriptionTypeDefault
list-objectsList Objects--
download-objectsDownload Given Objects--
list-and-download-objectsList Objects and Download--

list-objects

List Objects

download-objects

Download Given Objects

list-and-download-objects

List Objects and Download

fingerprinting

Enable object fingerprinting, which will cause a object to only be downloaded once

Type: bool

maximum-fingerprint-age

Remove any object fingerprints older than this from the tracker

Type: MaxAgeSpecifier

Field NameDescriptionTypeDefault
secondsSpecify the maximum age in number of secondsinteger-
durationSpecify the maximum age as a human readable duration (example: 1 hour)string-

seconds

Specify the maximum age in number of seconds

Type: integer

duration

Specify the maximum age as a human readable duration (example: 1 hour)

Type: string

preprocessors

Preprocessors (process downloaded data before making it available to the pipeline) these processors will be run in the order they are specified

Type: PreProcessor

Field NameDescriptionTypeDefault
extensionPreprocess the object or blob based on the extension of the object or blob name (.gz, .parquet)--
gzipUnGzip the received data--
parquetExtract the received data as JSON rows from a parquet file--

extension

Preprocess the object or blob based on the extension of the object or blob name (.gz, .parquet)

gzip

UnGzip the received data

parquet

Extract the received data as JSON rows from a parquet file