CI build.yaml archives

Introduction

CI’s build.yaml currently supports job outputs and inputs. An output may refer to any directory or file in the /io filesystem. Outputs are mapped to an “output filesystem” whose root roughly corresponds to some folder in GCS. An input may refer to any directory or file in the “output filesystem,” even unmentioned sub-directories of explicitly mentioned output directories. Consider a runImage job with this file hierarchy:

/io/
  foo
  dir/
    bar
    dir2/
      baz

This is a valid outputs block:

outputs:
- from: /io/file
  to: /foo/file
- from: /io/dir
  to: /dir

Another job which depends on it may specify this inputs block:

inputs:
- from: /foo/file
  to: /io/foo/file
- from: /dir/dir2
  to: /io/dir2
- from: /dir/bar
  to: /io/dir/bar

This design has a major issue: gsutil takes orders of magnitude more time to recursively copy a file hierarchy with many paths than it takes to tar the hierarchy, copy the tar, and untar on the receiver.

Users may currently address this issue by explicitly using tar and untar in runImage steps that respectively produce and consume files. Unfortunately, if a buildImage step depends on only a sub-directory of a tar’ed directory, because of the nature of Docker’s COPY, the image will grow by at least the size of the entire tar of the file.

Proposal

  • Extend the inputs and outputs syntax and functionality to enable tar’ing files.
  • Extend the inputs and outputs syntax and functionality to enable extraction of portions or entirety of a tar.
  • Force users to explicitly choose to tar or recursively copy a directory, thus avoiding accidental terrible performance.

Syntax & Semantics

outputs:
# copies a file, errors if /foo/bar is a dir
- from: /foo/bar
  to: /foo/bar
# recursively copy the contents of /foo/bar to the output folder /foo/bar
- from: /foo/bar
  to: /foo/bar
  directory: recursive
# copy an "archive" of the contents of /foo/bar to the output file /foo/bar
- from: /foo/bar
  to: /foo/bar
  directory: archive
inputs:
# copies a file errors if /foo/bar is a dir
- from: /foo/bar
  to: /foo/bar
# recursively download all of /foo/bar
- from: /foo/bar
  to: /foo/bar
  directory: recursive
# recursively download all of /foo/bar/baz
- from: /foo/bar/baz
  to: /foo/bar/baz
  directory: recursive
# extract the contents of the archived directory /foo/bar
# into the input folder /foo/bar
- from: /foo/bar
  to: /foo/bar
  directory: archive
# extract a single file or subdir out of the archive /foo/bar
# into the input folder /foo/bar
- from: /foo/bar
  to: /foo/bar
  directory: archive
  extract:
    - baz

CI is free to compress and decompress in whatever manner it sees fit. I propose CI uses gunzip. The compressing is done in the output/input containers.