Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-6143

Make Fragment Runner's RPC Timeout a SystemOption

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.13.0
    • 1.13.0
    • None

    Description

      Queries frequently fail sporadically on some clusters due to the following error

      oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION ERROR: Exceeded timeout (25000) while waiting send intermediate work fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.
      

      This error happens because the FragmentsRunner has a hardcoded timeout RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the timeout to 10 seconds resolved the sporadic failures that were observed. This timeout should be changed to 10 and should also be configurable via the SystemOptionManager

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            timothyfarkas Timothy Farkas
            timothyfarkas Timothy Farkas
            Boaz Ben-Zvi Boaz Ben-Zvi
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment