Secret Handling

cc: @cseed @akotlar @jigold

We need to protect the use of secrets in the vdc. There is plethora of secrets in the default namespace. There are several in batch-pods. There is one in test.

We also need to protect the use of k8s service accounts in the vdc. There are several in the default namespace. There are three in batch-pods. There is one in test.

We also need a system to generate secrets from some root credentials (e.g. a hail team member’s broad login credentials).

Currently batch allows anyone to mount any secret and use any service account in the batch-pods namespace (or test namespace, for PRs that are testing batch or ci). Batch runs in two contexts:

  • main batch instance in default (creating pods in batch-pods)
  • PR batch instance in batch-pods (creating pods in test)

Batch clients come (or will come) from five places:

  • ci in default
  • pipeline from a notebook (so, protected by an authentication check)
  • PR batch-tests in batch-pods
  • PR ci in batch-pods
  • PR pipelines in batch-pods

We must prevent a PR from leaking a secret in logs or by sending it to the public internet. We currently ignore malicious notebook pipeline users (we assume Konrad and Liam are careful and good actors).

A couple concrete proposals:

  • pods in batch-pods should not be permitted to talk to the default namespace’s batch (ci & pipeline already spin up a batch instance which creates pods in test)
  • CI should only test PRs created by whitelisted users (these people: https://github.com/orgs/hail-is/people)
  • If a hail team member approves a PR, then a PR created by a non-whitelisted user will be tested
  • Hail team reviewers must verify a third-party PR does not insecurely handle secrets.

I think when we have untrusted notebook pipeline users, we will need batch to authorize access to service accounts and secrets.

Regenerating secrets is also important, but I haven’t put much thought into that yet.

I think we may also benefit from some secret naming conventions. For GCP service account keys: gcp-sa-key-SA_SHORT_NAME. These secrets should contain exactly one file called key.json. Here’s an example of creating a new service account and associated GKE secret:

SA_NAME=batch-volume-tester
gcloud config set project hail-vdc
gcloud iam service-accounts create ${SA_NAME}
gcloud iam service-accounts keys create key.json \
    --iam-account=${SA_NAME}@hail-vdc.iam.gserviceaccount.com
kubectl create secret generic \
    gcp-sa-key-${SA_NAME} \
    --namespace=test \
    --from-file=key.json

I wrapped up key creation and deletion into a couple scripts: https://github.com/hail-is/hail/pull/5569.

I think there are a few issues here we can tease apart. Some unorganized comments:

  • Have a look at the security section of my draft document “Hail as a Service product plan.” I address some of these issues.
  • In addition to associating a GCP service account with each user, we should associate a k8s service account.
  • Batch should only accept authenticated requests via gateway. I think this unifies the source of batch clients.
  • Notebook should be segregated from the “internal” (authenticated) part of the cluster with k8s NetworkPolicies. All communication routes should be whitelisted. Notebook pipeline users have to go through gateway.
  • ci needs an account for submitting batch jobs.
  • All pods spawned by a user should use the user’s service account.
  • Then users should only be able to mount secrets they have permissions for.

pods in batch-pods should not be permitted to talk to the default namespace’s batch

I don’t think this is necessary if batch requires authentication.

CI should only test PRs created by whitelisted users

This is probably a good idea in the short term, but if ci runs PRs as unauthorized pods I don’t think the PR jobs should have access to any secrets to leak. Is there another attack I’m missing?

My understanding of our gateway & authentication strategy is that downstream services trust requests have passed through the gateway. We can only ensure that if services are network-isolated from everything else in the cluster. That wasn’t what I was getting at here, but I think it’s an important point to keep in mind in this discussion.