Uploaded image for project: 'Traffic Server'
  1. Traffic Server
  2. TS-4801

Are we marking parents down too aggressively?

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 6.2.1, 7.0.0
    • Parent Proxy
    • None

    Description

      jrushford and I were looking at some code, and found this piece which looks suspicious:

          } else if (s->current.attempts < s->txn_conf->parent_connect_attempts) {
            s->current.attempts++;
      
            // Are we done with this particular parent?
            if ((s->current.attempts - 1) % s->http_config_param->per_parent_connect_attempts != 0) {
              // No we are not done with this parent so retry
              s->next_action = how_to_open_connection(s);
              DebugTxn("http_trans", "%s Retrying parent for attempt %d, max %" PRId64, "[handle_response_from_parent]",
                       s->current.attempts, s->http_config_param->per_parent_connect_attempts);
              return;
            } else {
              DebugTxn("http_trans", "%s %d per parent attempts exhausted", "[handle_response_from_parent]", s->current.attempts);
      
              // Only mark the parent down if we failed to connect
              //  to the parent otherwise slow origin servers cause
              //  us to mark the parent down
              if (s->current.state == CONNECTION_ERROR) {
                s->parent_params->markParentDown(&s->parent_result);
              }
              // We are done so look for another parent if any
              next_lookup = find_server_and_update_current_info(s);
            }
          } else {
            // Done trying parents... fail over to origin server if that is
            //   appropriate
            DebugTxn("http_trans", "[handle_response_from_parent] Error. No more retries.");
            s->parent_params->markParentDown(&s->parent_result);
            s->parent_result.result = PARENT_FAIL;
            next_lookup             = find_server_and_update_current_info(s);
          }
      

      Here's my thinking: It seems that if we exhaust parent_connect_attempts (which is proxy.config.http.parent_proxy.total_connect_attempts, default 4), we end up always marking whatever the current parent is as potentially down.

      This seems wrong, because it might do this even if the number of tries against that particular parent has not been exhausted (that is the proxy.config.http.parent_proxy.per_parent_connect_attempts setting, tested inside the loop).

      Looking at this, it feels that we should either remove that markParentDown() entirely, or at a minimum add the same checks as above, i.e.

      -      s->parent_params->markParentDown(&s->parent_result);
      +      if (s->current.state == CONNECTION_ERROR) {
      +        s->parent_params->markParentDown(&s->parent_result);
      +      }
      

      I'm still doing some more tests, one concern is that it's unclear if the earlier comment is correct, and that we can detect the difference between a connection failure to parent, vs an origin server response to the parent is just being slow (resulting in a timeout and no response to child).

      Attachments

        Issue Links

          Activity

            People

              zwoop Leif Hedstrom
              zwoop Leif Hedstrom
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m