Cloud NAT and removing External IPs from batch2-agents

dking · February 14, 2020, 10:03pm

If batch2-agents do not have external IPs, then they need an alternative
mechanism to speak to the Internet. Network Address Translation (NAT) is one
such mechanism. Google provides a scalable NAT called Cloud NAT.

Reading:

IP Addresses, Ports, and unique connection
limits. https://cloud.google.com/nat/docs/ports-and-addresses
Cloud NAT Overview https://cloud.google.com/nat/docs/overview

Connection Limits

A Cloud NAT has a configurable or auto-managed number of external IP
addresses. Each IP address has 64,512 ports. Each VM is assigned some set of
(IP, port) tuples. The set can not be larger than 1024. A VM can open no more
connections to a destination (an (IP, port, protocol) triple) than it has been
asigned (IP, port) tuples. For example, if a VM has 10 (IP, port) tuples
assigned to it, it can open no more than 10 connections to 1.1.1.1; however it
can have 10 connections to 1.1.1.1 and 10 more connections to 1.1.1.2.

On a 16 core batch2-agent, it is not unreasonable to have 160 distinct docker
containers. Each docker container might reasonably try to talk to
http://example.com. In fact, they may try to open many connections to a given
IP. If we choose the maximum allocation of 1024 (IP, port) tuples, then each
docker container can open 6 concurrent connections to the same IP address. With
the maximum allocation of 1024, we can NAT 63 batch2-agents through 1 internal NAT
IP address.

Timeouts

There are various configurable timeouts on connections through the Cloud
NAT. https://cloud.google.com/nat/docs/overview#specs-timeouts

Connection Reuse

Cloud NAT severely limits the frequency with which a VM can open and close a
connection to the same destination.

https://cloud.google.com/nat/docs/ports-and-addresses#ports-reuse-tcp

There is a two minute (non-configurable) refractory period before a VM can reuse
an (IP, port) tuple to speak to the same destination triplet. In our case of the
16 core machines with 160 docker containers, that means the container can open
one connection every twenty seconds. Ergo, a batch2-agent with a heavy container
load may provide poor network performance to the containers.

Topic		Replies	Views
RFC: Batch, Pipeline, CI roadmap	11	1034	July 10, 2019
Secret Handling	4	804	March 14, 2019
A Hardware/Software Architecture for Petabyte Datasets	2	1127	October 9, 2018
Codebase exploration, web	0	683	October 26, 2018
Command line tool	3	887	April 26, 2019

Cloud NAT and removing External IPs from batch2-agents

Connection Limits

Timeouts

Connection Reuse

Related topics