Introduction
CI’s build.yaml
currently supports job outputs and inputs. An output may refer to any directory or file in the /io
filesystem. Outputs are mapped to an “output filesystem” whose root roughly corresponds to some folder in GCS. An input may refer to any directory or file in the “output filesystem,” even unmentioned sub-directories of explicitly mentioned output directories. Consider a runImage
job with this file hierarchy:
/io/
foo
dir/
bar
dir2/
baz
This is a valid outputs
block:
outputs:
- from: /io/file
to: /foo/file
- from: /io/dir
to: /dir
Another job which depends on it may specify this inputs
block:
inputs:
- from: /foo/file
to: /io/foo/file
- from: /dir/dir2
to: /io/dir2
- from: /dir/bar
to: /io/dir/bar
This design has a major issue: gsutil
takes orders of magnitude more time to recursively copy a file hierarchy with many paths than it takes to tar the hierarchy, copy the tar, and untar on the receiver.
Users may currently address this issue by explicitly using tar and untar in runImage
steps that respectively produce and consume files. Unfortunately, if a buildImage
step depends on only a sub-directory of a tar’ed directory, because of the nature of Docker’s COPY
, the image will grow by at least the size of the entire tar of the file.
Proposal
- Extend the inputs and outputs syntax and functionality to enable tar’ing files.
- Extend the inputs and outputs syntax and functionality to enable extraction of portions or entirety of a tar.
- Force users to explicitly choose to tar or recursively copy a directory, thus avoiding accidental terrible performance.
Syntax & Semantics
outputs:
# copies a file, errors if /foo/bar is a dir
- from: /foo/bar
to: /foo/bar
# recursively copy the contents of /foo/bar to the output folder /foo/bar
- from: /foo/bar
to: /foo/bar
directory: recursive
# copy an "archive" of the contents of /foo/bar to the output file /foo/bar
- from: /foo/bar
to: /foo/bar
directory: archive
inputs:
# copies a file errors if /foo/bar is a dir
- from: /foo/bar
to: /foo/bar
# recursively download all of /foo/bar
- from: /foo/bar
to: /foo/bar
directory: recursive
# recursively download all of /foo/bar/baz
- from: /foo/bar/baz
to: /foo/bar/baz
directory: recursive
# extract the contents of the archived directory /foo/bar
# into the input folder /foo/bar
- from: /foo/bar
to: /foo/bar
directory: archive
# extract a single file or subdir out of the archive /foo/bar
# into the input folder /foo/bar
- from: /foo/bar
to: /foo/bar
directory: archive
extract:
- baz
CI is free to compress and decompress in whatever manner it sees fit. I propose CI uses gunzip. The compressing is done in the output/input containers.