Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-6143

Make Fragment Runner's RPC Timeout a SystemOption

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13.0
    • Fix Version/s: 1.13.0
    • Component/s: None
    • Labels:

      Description

      Queries frequently fail sporadically on some clusters due to the following error

      oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION ERROR: Exceeded timeout (25000) while waiting send intermediate work fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.
      

      This error happens because the FragmentsRunner has a hardcoded timeout RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the timeout to 10 seconds resolved the sporadic failures that were observed. This timeout should be changed to 10 and should also be configurable via the SystemOptionManager

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                timothyfarkas Timothy Farkas
                Reporter:
                timothyfarkas Timothy Farkas
                Reviewer:
                Boaz Ben-Zvi
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: