Uploaded image for project: 'Traffic Server'
  1. Traffic Server
  2. TS-2954

cache poisoning due to proxy.config.http.use_client_target_addr = 1

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1.0
    • Component/s: Cache, DNS, Security, TProxy
    • Labels:
      None

      Description

      Current implementation of proxy.config.http.use_client_target_addr opens a very simple attack vector for cache poisoning in transparent forwarding mode.

      An attacker (or malware installed on innocent end-user computer) puts a fake IP for popular website like www.google.com or www.facebook.com in hosts file on PC behind the proxy. Once an infected PC requests the webpage in question, a cacheable fake response poisons the cache.

      In order to prevent such scenarios (as well as some others) Squid have implemented a mechanism known as Host Header Forgery Detection.

      In short, while requesting an URL from origin server IP as hinted by the client, proxy makes independent DNS query in parallel in order to determine if client supplied IP belongs to requested domain name. In case of discrepancy between DNS and client IP, the transaction shall be flagged as non-cacheable to avoid possible cache poisoning, while still serving the origin response to the client.

      1. ts-2954.patch
        11 kB
        Susan Hinrichs
      2. ts-2954-final.patch
        13 kB
        Susan Hinrichs
      3. ts-2954-take2.patch
        12 kB
        Susan Hinrichs

        Activity

        Hide
        shinrich Susan Hinrichs added a comment -

        Yes, the use_client_target_addr does expose a cache poisoning attack. The Host Header Forgery Detection would solve it in some (most?) cases.

        If the client and the trafficserver have access to the same DNS information then the Host Header Forgery Detection would solve the problem. The client could specify which specific server address the name should resolve to (important if the IP address is embedded elsewhere in the protocol). And trafficserver could verify this. The cache poisoner would be caught out.

        However, if you have a scenario where the client has access to different DNS information than the trafficserver, the Host Header Forgery Detection would not help. Perhaps the client is sitting with access to a corporate DNS server that the trafficserver does not have access to.

        Perhaps a tiered solution would work? By default set up proxy.config.http.use_client_target_addr to also employ Host Header Forgery Detection. Then add another option to disable the Host Header Forgery Detection in the small set of cases that have DNS mismatches between the client and the proxy.

        Show
        shinrich Susan Hinrichs added a comment - Yes, the use_client_target_addr does expose a cache poisoning attack. The Host Header Forgery Detection would solve it in some (most?) cases. If the client and the trafficserver have access to the same DNS information then the Host Header Forgery Detection would solve the problem. The client could specify which specific server address the name should resolve to (important if the IP address is embedded elsewhere in the protocol). And trafficserver could verify this. The cache poisoner would be caught out. However, if you have a scenario where the client has access to different DNS information than the trafficserver, the Host Header Forgery Detection would not help. Perhaps the client is sitting with access to a corporate DNS server that the trafficserver does not have access to. Perhaps a tiered solution would work? By default set up proxy.config.http.use_client_target_addr to also employ Host Header Forgery Detection. Then add another option to disable the Host Header Forgery Detection in the small set of cases that have DNS mismatches between the client and the proxy.
        Hide
        ngorchilov Nikolai Gorchilov added a comment - - edited

        My usual scenario is TPROXY in ISP NOC where (many) clients use different DNS servers then the cache do. In this same NOC Akamai, Google Peering and Google Cache are available.

        As a result both proxy.config.http.use_client_target_addr modes create different critical issues:

        • proxy.config.http.use_client_target_addr = 0 makes more than 20% of requests toward Akamai and Google properties to fail due to different dst IP used by client and cache. For obvious reasons local Akamai and Google nodes are excluded from traffic interception and redirection via the cache. When the client resolves an external CDN node, request gets diverted to the cache. Cache does independent DNS resolution and very often tries to forward the request to local node with spoofed src IP of the user. Due to exclusion from interception rules, the response from local node doesn't return to the cache, but goes directly to the client.
        • proxy.config.http.use_client_target_addr = 1 makes the cache vulnerable to poisoning - already experienced it in real life due to some malware - all major websites like Google, Facebook and Youtube were affected.

        So neither option really works for me. I need a Squid-like mode (say proxy.config.http.use_client_target_addr = 2 or new option), which requires the client target address to be used for contacting the origin server, while local DNS resolution is used for sanity checks - marking suspicious requests non-cacheable.

        Show
        ngorchilov Nikolai Gorchilov added a comment - - edited My usual scenario is TPROXY in ISP NOC where (many) clients use different DNS servers then the cache do. In this same NOC Akamai, Google Peering and Google Cache are available. As a result both proxy.config.http.use_client_target_addr modes create different critical issues: proxy.config.http.use_client_target_addr = 0 makes more than 20% of requests toward Akamai and Google properties to fail due to different dst IP used by client and cache. For obvious reasons local Akamai and Google nodes are excluded from traffic interception and redirection via the cache. When the client resolves an external CDN node, request gets diverted to the cache. Cache does independent DNS resolution and very often tries to forward the request to local node with spoofed src IP of the user. Due to exclusion from interception rules, the response from local node doesn't return to the cache, but goes directly to the client. proxy.config.http.use_client_target_addr = 1 makes the cache vulnerable to poisoning - already experienced it in real life due to some malware - all major websites like Google, Facebook and Youtube were affected. So neither option really works for me. I need a Squid-like mode (say proxy.config.http.use_client_target_addr = 2 or new option), which requires the client target address to be used for contacting the origin server, while local DNS resolution is used for sanity checks - marking suspicious requests non-cacheable.
        Hide
        shinrich Susan Hinrichs added a comment -

        I think I understand your scenario. While not caching "suspicious" traffic will reduce your hit rate, I suppose it is better to cache some things safely rather than caching more things unsafely.

        In your environment, do you have an estimate of what percentage of your traffic would be marked as suspicious if your proposed proxy.config.http.use_client_target_addr = 2 was present?

        Show
        shinrich Susan Hinrichs added a comment - I think I understand your scenario. While not caching "suspicious" traffic will reduce your hit rate, I suppose it is better to cache some things safely rather than caching more things unsafely. In your environment, do you have an estimate of what percentage of your traffic would be marked as suspicious if your proposed proxy.config.http.use_client_target_addr = 2 was present?
        Hide
        ngorchilov Nikolai Gorchilov added a comment -

        Susan Hinrichs, Probably, yes, if you mean reduced hit-rate due to reduced chance of caching an object. Otherwise, for fresh HIT results, there shouldn't be any IP address security checks. AFAICT Squid uses EDNS jumbograms to collect as much as possible IP addresses for the specified host in order to limit the number of false positives.
        Alan M. Carroll, Thanks. I'm available for feedback and testing.

        Show
        ngorchilov Nikolai Gorchilov added a comment - Susan Hinrichs , Probably, yes, if you mean reduced hit-rate due to reduced chance of caching an object. Otherwise, for fresh HIT results, there shouldn't be any IP address security checks. AFAICT Squid uses EDNS jumbograms to collect as much as possible IP addresses for the specified host in order to limit the number of false positives. Alan M. Carroll , Thanks. I'm available for feedback and testing.
        Hide
        zwoop Leif Hedstrom added a comment -

        Alan, if you don't mind, I think we should assign this bug to Susan, and also pick a target version.

        Show
        zwoop Leif Hedstrom added a comment - Alan, if you don't mind, I think we should assign this bug to Susan, and also pick a target version.
        Hide
        shinrich Susan Hinrichs added a comment -

        I got the basic proxy client address verification support in yesterday. I plan on doing more tests today and some debug message cleanup and hope to have a patch for others to try out later today.

        One thing I observed in my testing so far. For domains with lots of addresses (like google.com, youtube,com, and i.ytimg.com), the servers only seem to return 10 or so addresses at a time. Working against the google public dns server (8.8.8.8), the set of 10 would vary over a broader set of one hundred or so. So it is quite likely (and I saw in my testing), that the client picks a valid addresses out of one set of addresses for google.com, but the proxy checks against another set of valid addresses for google.com and so marks the response as uncacheable.

        Not sure there is much to be done here. Once at item is cached, the mismatch won't matter. But with false mismatches between the client and the proxy DNS lookups, the number of requests to get an item into the cache will increase.

        One could consider tracking both a validation set and a current address set in hostDB. Old address sets are moved into the validation set and used only to specify client specified origin server addresses.

        Looking at +edns with dig, it doesn't seem that more IPs are returned in that case either. But I only did some very basic checks. It could well be that some DNS server do return more addresses with the EDNS support. When we rework the hostDB DNS caching logic, supporting EDNS should also be added.

        Any other ideas or suggestions?

        Show
        shinrich Susan Hinrichs added a comment - I got the basic proxy client address verification support in yesterday. I plan on doing more tests today and some debug message cleanup and hope to have a patch for others to try out later today. One thing I observed in my testing so far. For domains with lots of addresses (like google.com, youtube,com, and i.ytimg.com), the servers only seem to return 10 or so addresses at a time. Working against the google public dns server (8.8.8.8), the set of 10 would vary over a broader set of one hundred or so. So it is quite likely (and I saw in my testing), that the client picks a valid addresses out of one set of addresses for google.com, but the proxy checks against another set of valid addresses for google.com and so marks the response as uncacheable. Not sure there is much to be done here. Once at item is cached, the mismatch won't matter. But with false mismatches between the client and the proxy DNS lookups, the number of requests to get an item into the cache will increase. One could consider tracking both a validation set and a current address set in hostDB. Old address sets are moved into the validation set and used only to specify client specified origin server addresses. Looking at +edns with dig, it doesn't seem that more IPs are returned in that case either. But I only did some very basic checks. It could well be that some DNS server do return more addresses with the EDNS support. When we rework the hostDB DNS caching logic, supporting EDNS should also be added. Any other ideas or suggestions?
        Hide
        amc Alan M. Carroll added a comment -

        A potential approach to the caching problem is to proxy the DNS requests so that the client and ATS are in rough synchronization. This would be a rather significant change to ATS in its current form. However, the planned HostDB upgrade would make this much simpler as a different DNS server could be used per deployment instance. This would have two big advantages - (1) only deployments that need it would have it and (2) a third party DNS recursive server could be used making keeping up with DNS related security issues much easier.

        Show
        amc Alan M. Carroll added a comment - A potential approach to the caching problem is to proxy the DNS requests so that the client and ATS are in rough synchronization. This would be a rather significant change to ATS in its current form. However, the planned HostDB upgrade would make this much simpler as a different DNS server could be used per deployment instance. This would have two big advantages - (1) only deployments that need it would have it and (2) a third party DNS recursive server could be used making keeping up with DNS related security issues much easier.
        Hide
        shinrich Susan Hinrichs added a comment -

        Nikolai Gorchilov could you take a look at the patch and give it a test?

        Show
        shinrich Susan Hinrichs added a comment - Nikolai Gorchilov could you take a look at the patch and give it a test?
        Hide
        ngorchilov Nikolai Gorchilov added a comment - - edited

        Susan Hinrichs, unfortunately it doesn't work for me, at least against 5.0.1 tarballs.

        proxy.config.http.use_client_target_addr = 2

        Poisoning requests for http://i.ytimg.com/vi_webp/6eKYsYUlGB8/mqdefault.webp via wget -qSO/dev/null --header="Host: i.ytimg.com" http://91.239.13.61/vi_webp/6eKYsYUlGB8/mqdefault.webp

        Here's the relevant http_trans log how the fake response gets cached, regardless of the invalid (91.239.13.61) IP of i.ytimg.com:

        (http_trans) START HttpTransact::ModifyRequest
        (http_trans) [ink_cluster_time] local: 1406705552, highest_delta: 0, cluster: 1406705552
        (http_trans) END HttpTransact::ModifyRequest
        (http_trans) Checking if transaction wants to upgrade
        (http_trans) Next action SM_ACTION_API_READ_REQUEST_HDR; HttpTransact::StartRemapRequest
        (http_trans) START HttpTransact::StartRemapRequest
        (http_trans) Before Remapping:
        (http_trans) END HttpTransact::StartRemapRequest
        (http_trans) Next action SM_ACTION_API_PRE_REMAP; HttpTransact::PerformRemap
        (http_trans) Inside PerformRemap
        (http_trans) Next action SM_ACTION_REMAP_REQUEST; HttpTransact::EndRemapRequest
        (http_trans) START HttpTransact::EndRemapRequest
        (http_trans) EndRemapRequest host is i.ytimg.com
        (http_trans) After Remapping:
        (http_trans) END HttpTransact::EndRemapRequest
        (http_trans) Next action SM_ACTION_API_POST_REMAP; HttpTransact::HandleRequest
        (http_trans) START HttpTransact::HandleRequest
        (http_trans) [init_stat_vars_from_req] set req cont length to 0
        (http_trans) [is_request_valid]no request header errors
        (http_trans) [DecideCacheLookup] Will do cache lookup.
        (http_trans) Next action SM_ACTION_CACHE_LOOKUP; __null
        (http_trans) [HttpTransact::HandleCacheOpenRead]
        (http_trans) CacheOpenRead -- miss
        (http_trans) Next action SM_ACTION_DNS_LOOKUP; OSDNSLookup
        (http_trans) [HttpTransact::OSDNSLookup] This was attempt 1
        (http_trans) [OSDNSLookup] DNS lookup for O.S. successful IP: 91.239.13.61
        (http_trans) Next action SM_ACTION_API_OS_DNS; HandleCacheOpenReadMiss
        (http_trans) [HandleCacheOpenReadMiss] --- MISS
        (http_trans) [build_request] removing host name from url
        (http_trans) [build_request] request like cacheable and conditional headers removed
        (http_trans) [ink_cluster_time] local: 1406705552, highest_delta: 0, cluster: 1406705552
        (http_trans) [build_request] request_sent_time: 1406705552
        (http_trans) Next action next; __null
        (http_trans) [HttpTransact::HandleResponse]
        (http_trans) [ink_cluster_time] local: 1406705552, highest_delta: 0, cluster: 1406705552
        (http_trans) [HandleResponse] response_received_time: 1406705552
        (http_trans) [is_response_valid] No errors in response
        (http_trans) [handle_response_from_server] (hrfs)
        (http_trans) [hrfs] connection alive
        (http_trans) [handle_forward_server_connection_open] (hfsco)
        (http_trans) [hfsco] cache action: CACHE_DO_WRITE
        (http_trans) [handle_cache_operation_on_forward_server_response] (hcoofsr)
        (http_trans) [is_response_cacheable] client permits storing
        (http_trans) [is_response_cacheable] YES by default 
        (http_trans) [hcoofsr] response is cacheable
        (http_trans) [hcoofsr] response code: 200
        (http_trans) [hcoofsr] cache write
        (http_trans) [handle_content_length_header] RESPONSE cont len in hdr is 5216
        (http_trans) [Squid code generation] Hit/Miss: 49, Log: 51, Hier: 50
        

        The following line bothers me most:

        (http_trans) [OSDNSLookup] DNS lookup for O.S. successful IP: 91.239.13.61
        
        Show
        ngorchilov Nikolai Gorchilov added a comment - - edited Susan Hinrichs , unfortunately it doesn't work for me, at least against 5.0.1 tarballs. proxy.config.http.use_client_target_addr = 2 Poisoning requests for http://i.ytimg.com/vi_webp/6eKYsYUlGB8/mqdefault.webp via wget -qSO/dev/null --header="Host: i.ytimg.com" http://91.239.13.61/vi_webp/6eKYsYUlGB8/mqdefault.webp Here's the relevant http_trans log how the fake response gets cached, regardless of the invalid (91.239.13.61) IP of i.ytimg.com: (http_trans) START HttpTransact::ModifyRequest (http_trans) [ink_cluster_time] local: 1406705552, highest_delta: 0, cluster: 1406705552 (http_trans) END HttpTransact::ModifyRequest (http_trans) Checking if transaction wants to upgrade (http_trans) Next action SM_ACTION_API_READ_REQUEST_HDR; HttpTransact::StartRemapRequest (http_trans) START HttpTransact::StartRemapRequest (http_trans) Before Remapping: (http_trans) END HttpTransact::StartRemapRequest (http_trans) Next action SM_ACTION_API_PRE_REMAP; HttpTransact::PerformRemap (http_trans) Inside PerformRemap (http_trans) Next action SM_ACTION_REMAP_REQUEST; HttpTransact::EndRemapRequest (http_trans) START HttpTransact::EndRemapRequest (http_trans) EndRemapRequest host is i.ytimg.com (http_trans) After Remapping: (http_trans) END HttpTransact::EndRemapRequest (http_trans) Next action SM_ACTION_API_POST_REMAP; HttpTransact::HandleRequest (http_trans) START HttpTransact::HandleRequest (http_trans) [init_stat_vars_from_req] set req cont length to 0 (http_trans) [is_request_valid]no request header errors (http_trans) [DecideCacheLookup] Will do cache lookup. (http_trans) Next action SM_ACTION_CACHE_LOOKUP; __null (http_trans) [HttpTransact::HandleCacheOpenRead] (http_trans) CacheOpenRead -- miss (http_trans) Next action SM_ACTION_DNS_LOOKUP; OSDNSLookup (http_trans) [HttpTransact::OSDNSLookup] This was attempt 1 (http_trans) [OSDNSLookup] DNS lookup for O.S. successful IP: 91.239.13.61 (http_trans) Next action SM_ACTION_API_OS_DNS; HandleCacheOpenReadMiss (http_trans) [HandleCacheOpenReadMiss] --- MISS (http_trans) [build_request] removing host name from url (http_trans) [build_request] request like cacheable and conditional headers removed (http_trans) [ink_cluster_time] local: 1406705552, highest_delta: 0, cluster: 1406705552 (http_trans) [build_request] request_sent_time: 1406705552 (http_trans) Next action next; __null (http_trans) [HttpTransact::HandleResponse] (http_trans) [ink_cluster_time] local: 1406705552, highest_delta: 0, cluster: 1406705552 (http_trans) [HandleResponse] response_received_time: 1406705552 (http_trans) [is_response_valid] No errors in response (http_trans) [handle_response_from_server] (hrfs) (http_trans) [hrfs] connection alive (http_trans) [handle_forward_server_connection_open] (hfsco) (http_trans) [hfsco] cache action: CACHE_DO_WRITE (http_trans) [handle_cache_operation_on_forward_server_response] (hcoofsr) (http_trans) [is_response_cacheable] client permits storing (http_trans) [is_response_cacheable] YES by default (http_trans) [hcoofsr] response is cacheable (http_trans) [hcoofsr] response code: 200 (http_trans) [hcoofsr] cache write (http_trans) [handle_content_length_header] RESPONSE cont len in hdr is 5216 (http_trans) [Squid code generation] Hit/Miss: 49, Log: 51, Hier: 50 The following line bothers me most: (http_trans) [OSDNSLookup] DNS lookup for O.S. successful IP: 91.239.13.61
        Hide
        shinrich Susan Hinrichs added a comment -

        Ah, I was only considering the case of DNS games against the name in the URL. Not the case of an IP in the URL. Obvious in retrospect. Will get that case dealt with today.

        Show
        shinrich Susan Hinrichs added a comment - Ah, I was only considering the case of DNS games against the name in the URL. Not the case of an IP in the URL. Obvious in retrospect. Will get that case dealt with today.
        Hide
        ngorchilov Nikolai Gorchilov added a comment -

        As this is TPROXY setup, there's neither hostname, nor ip address in the requested URL.

        $ tcpflow -ci en1 net 91.239.13.61 and port 80
        tcpflow[79736]: listening on en1
        192.168.001.113.52084-091.239.013.061.00080: GET /vi_webp/6eKYsYUlGB8/mqdefault.webp HTTP/1.1
        User-Agent: Wget/1.13 (darwin11.4.0)
        Accept: */*
        Host: i.ytimg.com
        Connection: Keep-Alive
        

        As you can see, the destination IP has been used by the wget itself - dst is 091.239.013.061.00080

        Show
        ngorchilov Nikolai Gorchilov added a comment - As this is TPROXY setup, there's neither hostname, nor ip address in the requested URL. $ tcpflow -ci en1 net 91.239.13.61 and port 80 tcpflow[79736]: listening on en1 192.168.001.113.52084-091.239.013.061.00080: GET /vi_webp/6eKYsYUlGB8/mqdefault.webp HTTP/1.1 User-Agent: Wget/1.13 (darwin11.4.0) Accept: */* Host: i.ytimg.com Connection: Keep-Alive As you can see, the destination IP has been used by the wget itself - dst is 091.239.013.061.00080
        Hide
        shinrich Susan Hinrichs added a comment - - edited

        The problem is that I didn't define what the new attributes for proxy.config.http.use_client_target_addr. The other problem is that my brain doesn't work so well until after the 5th cup of coffee so I just noticed your setting now.

        proxy.config.http.use_client_target_addr = 0 - Means ignore the client used address and rely on the proxy's DNS resolution of the host field (behavior unchanged)
        proxy.config.http.use_client_target_addr = 1 - Means use the client used address, but do not cache if the client used address does not match one of the proxy's DNS resolved addresses
        proxy.config.http.use_client_target_addr = 2 - Means the same as 1 in previous builds. Use the client used address. Cache in any case. Do not validate against the proxy's DNS resolved addresses.

        Originally I had proposed putting the new behavior on 2, but after talking with Alan, we decided it would be safer to change the behavior of the already deployed setting. Folks who really want to live dangerously can change the setting to 2. By default deployments will gain the security improvement.

        So your tests with proxy.config.http.use_client_target_addr = 2 are working as designed. Sorry for not clarifying earlier. Could you give it a try with proxy.config.http.use_client_target_addr = 1. I did your experiment and things worked as I would expect.

        The log message "(http_trans) [OSDNSLookup] DNS lookup for O.S. successful IP: 91.239.13.61" is a bit confusing. It is printing the IP address that has be calculated for the OS regardless of whether a DNS lookup has been done or not.

        You should be seeing a log message like "(http_trans) [is_response_cacheable] Lookup not validated. Possible DNS cache poison. Don't cache"

        Show
        shinrich Susan Hinrichs added a comment - - edited The problem is that I didn't define what the new attributes for proxy.config.http.use_client_target_addr. The other problem is that my brain doesn't work so well until after the 5th cup of coffee so I just noticed your setting now. proxy.config.http.use_client_target_addr = 0 - Means ignore the client used address and rely on the proxy's DNS resolution of the host field (behavior unchanged) proxy.config.http.use_client_target_addr = 1 - Means use the client used address, but do not cache if the client used address does not match one of the proxy's DNS resolved addresses proxy.config.http.use_client_target_addr = 2 - Means the same as 1 in previous builds. Use the client used address. Cache in any case. Do not validate against the proxy's DNS resolved addresses. Originally I had proposed putting the new behavior on 2, but after talking with Alan, we decided it would be safer to change the behavior of the already deployed setting. Folks who really want to live dangerously can change the setting to 2. By default deployments will gain the security improvement. So your tests with proxy.config.http.use_client_target_addr = 2 are working as designed. Sorry for not clarifying earlier. Could you give it a try with proxy.config.http.use_client_target_addr = 1. I did your experiment and things worked as I would expect. The log message "(http_trans) [OSDNSLookup] DNS lookup for O.S. successful IP: 91.239.13.61" is a bit confusing. It is printing the IP address that has be calculated for the OS regardless of whether a DNS lookup has been done or not. You should be seeing a log message like "(http_trans) [is_response_cacheable] Lookup not validated. Possible DNS cache poison. Don't cache"
        Hide
        ngorchilov Nikolai Gorchilov added a comment -

        Oh, I see.

        It makes sense to apply the new logic by default when using client target address and the current insecure logic to be explicitly activated by using the new value of 2.

        After setting proxy.config.http.use_client_target_addr = 1 it seems to be working as expected. At least in lab tests:

        (http_trans) START HttpTransact::ModifyRequest
        (http_trans) [ink_cluster_time] local: 1406732859, highest_delta: 0, cluster: 1406732859
        (http_trans) END HttpTransact::ModifyRequest
        (http_trans) Checking if transaction wants to upgrade
        (http_trans) Next action SM_ACTION_API_READ_REQUEST_HDR; HttpTransact::StartRemapRequest
        (http_trans) START HttpTransact::StartRemapRequest
        (http_trans) Before Remapping:
        (http_trans) END HttpTransact::StartRemapRequest
        (http_trans) Next action SM_ACTION_API_PRE_REMAP; HttpTransact::PerformRemap
        (http_trans) Inside PerformRemap
        (http_trans) Next action SM_ACTION_REMAP_REQUEST; HttpTransact::EndRemapRequest
        (http_trans) START HttpTransact::EndRemapRequest
        (http_trans) EndRemapRequest host is i.ytimg.com
        (http_trans) After Remapping:
        (http_trans) END HttpTransact::EndRemapRequest
        (http_trans) Next action SM_ACTION_API_POST_REMAP; HttpTransact::HandleRequest
        (http_trans) START HttpTransact::HandleRequest
        (http_trans) [init_stat_vars_from_req] set req cont length to 0
        (http_trans) [is_request_valid]no request header errors
        (http_trans) [DecideCacheLookup] Will do cache lookup.
        (http_trans) Next action SM_ACTION_CACHE_LOOKUP; __null
        (http_trans) [HttpTransact::HandleCacheOpenRead]
        (http_trans) CacheOpenRead -- miss
        (http_trans) Next action SM_ACTION_DNS_LOOKUP; OSDNSLookup
        (http_trans) [ink_cluster_time] local: 1406732859, highest_delta: 0, cluster: 1406732859
        (http_trans) [HttpTransact::OSDNSLookup] This was attempt 1
        (http_trans) [OSDNSLookup] DNS lookup for O.S. successful IP: 91.239.13.61
        (http_trans) Next action SM_ACTION_API_OS_DNS; HandleCacheOpenReadMiss
        (http_trans) [HandleCacheOpenReadMiss] --- MISS
        (http_trans) [build_request] removing host name from url
        (http_trans) [build_request] request like cacheable and conditional headers removed
        (http_trans) [ink_cluster_time] local: 1406732859, highest_delta: 0, cluster: 1406732859
        (http_trans) [build_request] request_sent_time: 1406732859
        (http_trans) Next action next; __null
        (http_trans) [HttpTransact::HandleResponse]
        (http_trans) [ink_cluster_time] local: 1406732859, highest_delta: 0, cluster: 1406732859
        (http_trans) [HandleResponse] response_received_time: 1406732859
        (http_trans) [is_response_valid] No errors in response
        (http_trans) [handle_response_from_server] (hrfs)
        (http_trans) [hrfs] connection alive
        (http_trans) [handle_forward_server_connection_open] (hfsco)
        (http_trans) [hfsco] cache action: CACHE_DO_WRITE
        (http_trans) [handle_cache_operation_on_forward_server_response] (hcoofsr)
        (http_trans) [is_response_cacheable] Lookup not validated.  Possible DNS cache poison.  Don't cache
        (http_trans) [hcoofsr] response is not cacheable
        (http_trans) [hcoofsr] response code: 200
        (http_trans) [handle_content_length_header] RESPONSE cont len in hdr is 5216
        (http_trans) [Squid code generation] Hit/Miss: 49, Log: 51, Hier: 50
        

        Planing for production trial tomorrow. Once done, will get back to you with results.

        Thanks!

        Show
        ngorchilov Nikolai Gorchilov added a comment - Oh, I see. It makes sense to apply the new logic by default when using client target address and the current insecure logic to be explicitly activated by using the new value of 2. After setting proxy.config.http.use_client_target_addr = 1 it seems to be working as expected. At least in lab tests: (http_trans) START HttpTransact::ModifyRequest (http_trans) [ink_cluster_time] local: 1406732859, highest_delta: 0, cluster: 1406732859 (http_trans) END HttpTransact::ModifyRequest (http_trans) Checking if transaction wants to upgrade (http_trans) Next action SM_ACTION_API_READ_REQUEST_HDR; HttpTransact::StartRemapRequest (http_trans) START HttpTransact::StartRemapRequest (http_trans) Before Remapping: (http_trans) END HttpTransact::StartRemapRequest (http_trans) Next action SM_ACTION_API_PRE_REMAP; HttpTransact::PerformRemap (http_trans) Inside PerformRemap (http_trans) Next action SM_ACTION_REMAP_REQUEST; HttpTransact::EndRemapRequest (http_trans) START HttpTransact::EndRemapRequest (http_trans) EndRemapRequest host is i.ytimg.com (http_trans) After Remapping: (http_trans) END HttpTransact::EndRemapRequest (http_trans) Next action SM_ACTION_API_POST_REMAP; HttpTransact::HandleRequest (http_trans) START HttpTransact::HandleRequest (http_trans) [init_stat_vars_from_req] set req cont length to 0 (http_trans) [is_request_valid]no request header errors (http_trans) [DecideCacheLookup] Will do cache lookup. (http_trans) Next action SM_ACTION_CACHE_LOOKUP; __null (http_trans) [HttpTransact::HandleCacheOpenRead] (http_trans) CacheOpenRead -- miss (http_trans) Next action SM_ACTION_DNS_LOOKUP; OSDNSLookup (http_trans) [ink_cluster_time] local: 1406732859, highest_delta: 0, cluster: 1406732859 (http_trans) [HttpTransact::OSDNSLookup] This was attempt 1 (http_trans) [OSDNSLookup] DNS lookup for O.S. successful IP: 91.239.13.61 (http_trans) Next action SM_ACTION_API_OS_DNS; HandleCacheOpenReadMiss (http_trans) [HandleCacheOpenReadMiss] --- MISS (http_trans) [build_request] removing host name from url (http_trans) [build_request] request like cacheable and conditional headers removed (http_trans) [ink_cluster_time] local: 1406732859, highest_delta: 0, cluster: 1406732859 (http_trans) [build_request] request_sent_time: 1406732859 (http_trans) Next action next; __null (http_trans) [HttpTransact::HandleResponse] (http_trans) [ink_cluster_time] local: 1406732859, highest_delta: 0, cluster: 1406732859 (http_trans) [HandleResponse] response_received_time: 1406732859 (http_trans) [is_response_valid] No errors in response (http_trans) [handle_response_from_server] (hrfs) (http_trans) [hrfs] connection alive (http_trans) [handle_forward_server_connection_open] (hfsco) (http_trans) [hfsco] cache action: CACHE_DO_WRITE (http_trans) [handle_cache_operation_on_forward_server_response] (hcoofsr) (http_trans) [is_response_cacheable] Lookup not validated. Possible DNS cache poison. Don't cache (http_trans) [hcoofsr] response is not cacheable (http_trans) [hcoofsr] response code: 200 (http_trans) [handle_content_length_header] RESPONSE cont len in hdr is 5216 (http_trans) [Squid code generation] Hit/Miss: 49, Log: 51, Hier: 50 Planing for production trial tomorrow. Once done, will get back to you with results. Thanks!
        Hide
        ngorchilov Nikolai Gorchilov added a comment - - edited

        Unfortunately, the following assert emerged just seconds after deployment in production:

        FATAL: HttpTransactHeaders.cc:675: failed assert `header->presence(mask) == mask`
        /z/bin/traffic_server - STACK TRACE: 
        /z/lib/libtsutil.so.5(+0x1e837)[0x2b515c66d837]
        /z/lib/libtsutil.so.5(+0x1d51f)[0x2b515c66c51f]
        /z/bin/traffic_server[0x5d17f2]
        /z/bin/traffic_server(HttpTransact::build_request(HttpTransact::State*, HTTPHdr*, HTTPHdr*, HTTPVersion)+0xa0)[0x5b28d0]
        /z/bin/traffic_server(HttpTransact::HandleCacheOpenReadHit(HttpTransact::State*)+0x1f9)[0x5c89c9]
        /z/bin/traffic_server(HttpSM::call_transact_and_set_next_state(void (*)(HttpTransact::State*))+0x66)[0x58e3a6]
        /z/bin/traffic_server(HttpSM::state_api_callout(int, void*)+0x343)[0x599c43]
        /z/bin/traffic_server(HttpSM::set_next_state()+0x1e8)[0x5a0518]
        /z/bin/traffic_server(HttpSM::do_hostdb_lookup()+0x27a)[0x58e89a]
        /z/bin/traffic_server(HttpSM::set_next_state()+0xcf8)[0x5a1028]
        /z/bin/traffic_server(HttpSM::state_read_server_response_header(int, void*)+0x1df)[0x59b66f]
        /z/bin/traffic_server(HttpSM::main_handler(int, void*)+0xd8)[0x59add8]
        /z/bin/traffic_server[0x715245]
        /z/bin/traffic_server(NetHandler::mainNetEvent(int, Event*)+0x1ed)[0x707a1d]
        /z/bin/traffic_server(EThread::process_event(Event*, int)+0x91)[0x736361]
        /z/bin/traffic_server(EThread::execute()+0x4fc)[0x736e1c]
        /z/bin/traffic_server[0x7355fa]
        /lib64/libpthread.so.0(+0x7034)[0x2b515dc73034]
        /lib64/libc.so.6(clone+0x6d)[0x2b515e9bdb5d]
        

        Could it be related to refresh check of already cached result?

        Show
        ngorchilov Nikolai Gorchilov added a comment - - edited Unfortunately, the following assert emerged just seconds after deployment in production: FATAL: HttpTransactHeaders.cc:675: failed assert `header->presence(mask) == mask` /z/bin/traffic_server - STACK TRACE: /z/lib/libtsutil.so.5(+0x1e837)[0x2b515c66d837] /z/lib/libtsutil.so.5(+0x1d51f)[0x2b515c66c51f] /z/bin/traffic_server[0x5d17f2] /z/bin/traffic_server(HttpTransact::build_request(HttpTransact::State*, HTTPHdr*, HTTPHdr*, HTTPVersion)+0xa0)[0x5b28d0] /z/bin/traffic_server(HttpTransact::HandleCacheOpenReadHit(HttpTransact::State*)+0x1f9)[0x5c89c9] /z/bin/traffic_server(HttpSM::call_transact_and_set_next_state(void (*)(HttpTransact::State*))+0x66)[0x58e3a6] /z/bin/traffic_server(HttpSM::state_api_callout(int, void*)+0x343)[0x599c43] /z/bin/traffic_server(HttpSM::set_next_state()+0x1e8)[0x5a0518] /z/bin/traffic_server(HttpSM::do_hostdb_lookup()+0x27a)[0x58e89a] /z/bin/traffic_server(HttpSM::set_next_state()+0xcf8)[0x5a1028] /z/bin/traffic_server(HttpSM::state_read_server_response_header(int, void*)+0x1df)[0x59b66f] /z/bin/traffic_server(HttpSM::main_handler(int, void*)+0xd8)[0x59add8] /z/bin/traffic_server[0x715245] /z/bin/traffic_server(NetHandler::mainNetEvent(int, Event*)+0x1ed)[0x707a1d] /z/bin/traffic_server(EThread::process_event(Event*, int)+0x91)[0x736361] /z/bin/traffic_server(EThread::execute()+0x4fc)[0x736e1c] /z/bin/traffic_server[0x7355fa] /lib64/libpthread.so.0(+0x7034)[0x2b515dc73034] /lib64/libc.so.6(clone+0x6d)[0x2b515e9bdb5d] Could it be related to refresh check of already cached result?
        Hide
        shinrich Susan Hinrichs added a comment - - edited

        Spent yesterday tracing through code. Finally see how this could happen. Indeed it is on the code path for updating a modified cache entry. Need to reproduce it in my environment to verify and fix.

        Not clear how much time I'll have to work on it today, but should have an updated patch available by Monday morning.

        Show
        shinrich Susan Hinrichs added a comment - - edited Spent yesterday tracing through code. Finally see how this could happen. Indeed it is on the code path for updating a modified cache entry. Need to reproduce it in my environment to verify and fix. Not clear how much time I'll have to work on it today, but should have an updated patch available by Monday morning.
        Hide
        ngorchilov Nikolai Gorchilov added a comment -

        OK. In case you need some more context during assert (i.e. some debug print) you can provide me with a patch that will do so and I can run in once again in order to collect the information.

        Show
        ngorchilov Nikolai Gorchilov added a comment - OK. In case you need some more context during assert (i.e. some debug print) you can provide me with a patch that will do so and I can run in once again in order to collect the information.
        Hide
        shinrich Susan Hinrichs added a comment -

        Nikolai, thanks for your offer to run again with more prints. Please try the ts-2954-take2.patch.

        I've added some Warnings, so additional information should show up in the diag.log. Tidied up some other things, but I don't think it will affect the assert going off.

        Show
        shinrich Susan Hinrichs added a comment - Nikolai, thanks for your offer to run again with more prints. Please try the ts-2954-take2.patch. I've added some Warnings, so additional information should show up in the diag.log. Tidied up some other things, but I don't think it will affect the assert going off.
        Hide
        ngorchilov Nikolai Gorchilov added a comment -

        Patched version has been deployed in production, serving roughly 1.3k reqs/sec. No manifestation of this assert in the first 15 minutes. Will keep it running for a while and will report my findings later today or tomorrow.

        Show
        ngorchilov Nikolai Gorchilov added a comment - Patched version has been deployed in production, serving roughly 1.3k reqs/sec. No manifestation of this assert in the first 15 minutes. Will keep it running for a while and will report my findings later today or tomorrow.
        Hide
        ngorchilov Nikolai Gorchilov added a comment -

        Almost 6 hour running in production, still no assert...

        Show
        ngorchilov Nikolai Gorchilov added a comment - Almost 6 hour running in production, still no assert...
        Hide
        ngorchilov Nikolai Gorchilov added a comment -

        24 hours later, still no manifestation of `header->presence(mask) == mask` assert. Deploying at two more smaller locations (700-800 reqs/sec each) in hope to catch it again.

        I'm wondering what should be the next step in case there isn't any appearance of the problem in 24 more hours? Shall we consider it fixed?

        Show
        ngorchilov Nikolai Gorchilov added a comment - 24 hours later, still no manifestation of `header->presence(mask) == mask` assert. Deploying at two more smaller locations (700-800 reqs/sec each) in hope to catch it again. I'm wondering what should be the next step in case there isn't any appearance of the problem in 24 more hours? Shall we consider it fixed?
        Hide
        shinrich Susan Hinrichs added a comment -

        I think we can consider it fixed. I did make a state change in the last patch as a result of a code review with Alan. I thought it was a low probability for fixing the problem. I cannot directly trace how fixing that flag would avoid executing the assert, but the state machine is a complex beast.

        Later this morning I'll post a hopefully final patch that removes some of the debug warning checks. If you can do a final sanity check on that, then I can submit the patch to the powers that be to get merged into 5.1.

        Show
        shinrich Susan Hinrichs added a comment - I think we can consider it fixed. I did make a state change in the last patch as a result of a code review with Alan. I thought it was a low probability for fixing the problem. I cannot directly trace how fixing that flag would avoid executing the assert, but the state machine is a complex beast. Later this morning I'll post a hopefully final patch that removes some of the debug warning checks. If you can do a final sanity check on that, then I can submit the patch to the powers that be to get merged into 5.1.
        Hide
        ngorchilov Nikolai Gorchilov added a comment -

        Sure will do sanity check.

        Show
        ngorchilov Nikolai Gorchilov added a comment - Sure will do sanity check.
        Hide
        shinrich Susan Hinrichs added a comment -

        Sorry for the delay. Nikolai, please give latest patch a try.

        Show
        shinrich Susan Hinrichs added a comment - Sorry for the delay. Nikolai, please give latest patch a try.
        Hide
        ngorchilov Nikolai Gorchilov added a comment -

        No problem, Susan. This (hopefully) final patch has been deployed in production. Let's keep it running for 12-18-24 hours before closing the case.

        Show
        ngorchilov Nikolai Gorchilov added a comment - No problem, Susan. This (hopefully) final patch has been deployed in production. Let's keep it running for 12-18-24 hours before closing the case.
        Hide
        ngorchilov Nikolai Gorchilov added a comment -

        This is to confirm that final patch is working as expected.

        Show
        ngorchilov Nikolai Gorchilov added a comment - This is to confirm that final patch is working as expected.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 826310ccc517350a105ec1e9f966cc0db6124937 in trafficserver's branch refs/heads/master from Susan Hinrichs
        [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=826310c ]

        TS-2954
        Document changes for use_client_target_addr

        Show
        jira-bot ASF subversion and git services added a comment - Commit 826310ccc517350a105ec1e9f966cc0db6124937 in trafficserver's branch refs/heads/master from Susan Hinrichs [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=826310c ] TS-2954 Document changes for use_client_target_addr
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 8bb3bc3186b53509964dd396ffb5a56bf1d67ab4 in trafficserver's branch refs/heads/master from Alan M. Carroll
        [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=8bb3bc3 ]

        TS-2954: Fix cache poisoning with use_client_target_addr. This closes #101 pull request.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 8bb3bc3186b53509964dd396ffb5a56bf1d67ab4 in trafficserver's branch refs/heads/master from Alan M. Carroll [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=8bb3bc3 ] TS-2954 : Fix cache poisoning with use_client_target_addr. This closes #101 pull request.
        Hide
        amc Alan M. Carroll added a comment -

        Pull request completed.

        Show
        amc Alan M. Carroll added a comment - Pull request completed.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 475e40cde4794f6b74c3436c6d00913804dff303 in trafficserver's branch refs/heads/master from Alan M. Carroll
        [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=475e40c ]

        TS-2954: Fix records so traffic_line works with use_client_target_addr

        Show
        jira-bot ASF subversion and git services added a comment - Commit 475e40cde4794f6b74c3436c6d00913804dff303 in trafficserver's branch refs/heads/master from Alan M. Carroll [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=475e40c ] TS-2954 : Fix records so traffic_line works with use_client_target_addr
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 0bfee488ff7f7488eb6b75d779e01ded3528946a in trafficserver's branch refs/heads/5.1.x from Alan M. Carroll
        [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=0bfee48 ]

        TS-2954: Fix records so traffic_line works with use_client_target_addr

        Show
        jira-bot ASF subversion and git services added a comment - Commit 0bfee488ff7f7488eb6b75d779e01ded3528946a in trafficserver's branch refs/heads/5.1.x from Alan M. Carroll [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=0bfee48 ] TS-2954 : Fix records so traffic_line works with use_client_target_addr
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 3f5a75ae87c38126a64264876a26841f5b0fca32 in trafficserver's branch refs/heads/master from James Peach
        [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=3f5a75a ]

        Add TS-2970 and TS-2954 to CHANGES

        Show
        jira-bot ASF subversion and git services added a comment - Commit 3f5a75ae87c38126a64264876a26841f5b0fca32 in trafficserver's branch refs/heads/master from James Peach [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=3f5a75a ] Add TS-2970 and TS-2954 to CHANGES

          People

          • Assignee:
            shinrich Susan Hinrichs
            Reporter:
            ngorchilov Nikolai Gorchilov
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development