Uploaded image for project: 'Traffic Server'
  1. Traffic Server
  2. TS-3226

SSL data not read from the socket sometimes causing transactions to timeout

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.1.1
    • 5.2.0
    • SSL
    • None

    Description

      We have had a really long standing problem where some of our origins were complaining of receiving POST requests with non-zero content-length header, but, no body (or sometimes, partial body). Due to the way our network was setup, this problem was not easy to be isolated due to the multiple hops along the way. The post body could be lost anywhere along the path (e.g. client, dns, routers/vips, edge, data center etc). After a lot of debugging and with the help of some custom-built wire traces for SSL, we managed to isolate the problem to our ATS hosts running on our edge layer. From the wire traces, we could see that, the post body is coming in alright, but is just sitting in the socket and not being read by the post ua tunnel producer.

      After further investigation, it seems that the producer is issuing the correct do_io_read for the required number of bytes, but, there seems to be a bug in the SSLNetVConnection::net_read_io, where the ntodo is being calculated before acquiring the mutex on the read vio.

      https://github.com/apache/trafficserver/blob/master/iocore/net/SSLNetVConnection.cc#L391

      Instrumenting the code with further debug traces showed that, in the failed transactions, I am noticing the ntodo being "0" when determined before the mutex, whereas the (s->vio.nbytes - s->vio.ndone) is non-zero after the mutex. I am not sure to understand how the nbytes on the read vio object can be different before acquiring mutex, but, moving the ntodo calculation after mutex seems to have resolved the problem. Note that this is how it is done in the corresponding function read_from_net in UnixNetVConnection.

      Talking to amc on the IRC, it seems that the mutex is needed coz, the SSLNetVConnection::net_read_io could also be triggered by an incoming socket data before the UnixNetVConnection::do_io_read could trigger it and that could mess up the nbytes/ndone in the read vio.

      Attachments

        Activity

          People

            sudheerv Sudheer Vinukonda
            sudheerv Sudheer Vinukonda
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: