in Airflow version 1.10.2 as well as 1.10.4, we experience a severe problem with the limitation of the number of concurrent tasks.
We observe that for more than 8 tasks being started and executed in parallel, that the majority of those tasks fails with the error "Can't connect to MySQL server" and error code 2006(99). This error code boils down to "Cannot bind socket to resource", which is why we started looking into the TCP conenctions of our Airflow host (a single node that hosts the webserver, scheduler and worker).
When the 8 tasks are simultaneously running, we observe more than 15,000 TIME_WAIT connections while less than 50 are established. Given, that the number of available ports is somewhat smaller than 30,000, this large number of blocked but unused TCP connections would explain the failing of further task executions.
Can anyone explain how these many open connections blocking ports/sockets come about? Given that we have connection pooling enabled, we do not see any explanation yet.
Your help is very much appreciated, this issue strongly limits our current performance!