build.yaml currently supports job outputs and inputs. An output may refer to any directory or file in the
/io filesystem. Outputs are mapped to an “output filesystem” whose root roughly corresponds to some folder in GCS. An input may refer to any directory or file in the “output filesystem,” even unmentioned sub-directories of explicitly mentioned output directories. Consider a
runImage job with this file hierarchy:
/io/ foo dir/ bar dir2/ baz
This is a valid
outputs: - from: /io/file to: /foo/file - from: /io/dir to: /dir
Another job which depends on it may specify this
inputs: - from: /foo/file to: /io/foo/file - from: /dir/dir2 to: /io/dir2 - from: /dir/bar to: /io/dir/bar
This design has a major issue:
gsutil takes orders of magnitude more time to recursively copy a file hierarchy with many paths than it takes to tar the hierarchy, copy the tar, and untar on the receiver.
Users may currently address this issue by explicitly using tar and untar in
runImage steps that respectively produce and consume files. Unfortunately, if a
buildImage step depends on only a sub-directory of a tar’ed directory, because of the nature of Docker’s
COPY, the image will grow by at least the size of the entire tar of the file.
- Extend the inputs and outputs syntax and functionality to enable tar’ing files.
- Extend the inputs and outputs syntax and functionality to enable extraction of portions or entirety of a tar.
- Force users to explicitly choose to tar or recursively copy a directory, thus avoiding accidental terrible performance.
Syntax & Semantics
outputs: # copies a file, errors if /foo/bar is a dir - from: /foo/bar to: /foo/bar # recursively copy the contents of /foo/bar to the output folder /foo/bar - from: /foo/bar to: /foo/bar directory: recursive # copy an "archive" of the contents of /foo/bar to the output file /foo/bar - from: /foo/bar to: /foo/bar directory: archive
inputs: # copies a file errors if /foo/bar is a dir - from: /foo/bar to: /foo/bar # recursively download all of /foo/bar - from: /foo/bar to: /foo/bar directory: recursive # recursively download all of /foo/bar/baz - from: /foo/bar/baz to: /foo/bar/baz directory: recursive # extract the contents of the archived directory /foo/bar # into the input folder /foo/bar - from: /foo/bar to: /foo/bar directory: archive # extract a single file or subdir out of the archive /foo/bar # into the input folder /foo/bar - from: /foo/bar to: /foo/bar directory: archive extract: - baz
CI is free to compress and decompress in whatever manner it sees fit. I propose CI uses gunzip. The compressing is done in the output/input containers.