If batch2-agents do not have external IPs, then they need an alternative
mechanism to speak to the Internet. Network Address Translation (NAT) is one
such mechanism. Google provides a scalable NAT called Cloud NAT.
Reading:
- IP Addresses, Ports, and unique connection
limits. https://cloud.google.com/nat/docs/ports-and-addresses - Cloud NAT Overview https://cloud.google.com/nat/docs/overview
Connection Limits
A Cloud NAT has a configurable or auto-managed number of external IP
addresses. Each IP address has 64,512 ports. Each VM is assigned some set of
(IP, port) tuples. The set can not be larger than 1024. A VM can open no more
connections to a destination (an (IP, port, protocol) triple) than it has been
asigned (IP, port) tuples. For example, if a VM has 10 (IP, port) tuples
assigned to it, it can open no more than 10 connections to 1.1.1.1; however it
can have 10 connections to 1.1.1.1 and 10 more connections to 1.1.1.2.
On a 16 core batch2-agent, it is not unreasonable to have 160 distinct docker
containers. Each docker container might reasonably try to talk to
http://example.com. In fact, they may try to open many connections to a given
IP. If we choose the maximum allocation of 1024 (IP, port) tuples, then each
docker container can open 6 concurrent connections to the same IP address. With
the maximum allocation of 1024, we can NAT 63 batch2-agents through 1 internal NAT
IP address.
Timeouts
There are various configurable timeouts on connections through the Cloud
NAT. https://cloud.google.com/nat/docs/overview#specs-timeouts
Connection Reuse
Cloud NAT severely limits the frequency with which a VM can open and close a
connection to the same destination.
https://cloud.google.com/nat/docs/ports-and-addresses#ports-reuse-tcp
There is a two minute (non-configurable) refractory period before a VM can reuse
an (IP, port) tuple to speak to the same destination triplet. In our case of the
16 core machines with 160 docker containers, that means the container can open
one connection every twenty seconds. Ergo, a batch2-agent with a heavy container
load may provide poor network performance to the containers.