[DRILL-6143] Make Fragment Runner's RPC Timeout a SystemOption - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.13.0
Fix Version/s: 1.13.0
Component/s: None
Labels:
- ready-to-commit

Description

Queries frequently fail sporadically on some clusters due to the following error

oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION ERROR: Exceeded timeout (25000) while waiting send intermediate work fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.

This error happens because the FragmentsRunner has a hardcoded timeout RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the timeout to 10 seconds resolved the sporadic failures that were observed. This timeout should be changed to 10 and should also be configurable via the SystemOptionManager

Attachments

Issue Links

links to

GitHub Pull Request #1119

Activity

People

Assignee:: Timothy Farkas

Reporter:: Timothy Farkas

Reviewer:: Boaz Ben-Zvi

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Feb/18 00:36

Updated:: 19/Feb/18 09:37

Resolved:: 19/Feb/18 09:37