[IMPALA-1599] Improve query start-up time with many fragment instances - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.0.1
Fix Version/s: Impala 2.5.0
Component/s: Distributed Exec
Labels:
- performance
- scalability

Target Version:

Impala 2.5.0

Description

We've seen cases where simple count queries over tables with a few thousand partitions are slow to start up, when the cluster is over a hundred nodes or so.

There are several problems, but here's the most significant: Impala in version 2.3.0 and earlier starts its fragments ‘top-down’ - i.e. a fragment is started before all fragments that send to it are. This means starting the coordinator fragment, then all the fragments at depth 1 in the plan tree that send to the coordinator fragment, then all the fragments at depth 2 and so on. Otherwise a fragment that starts before its recipients do will fail as soon as it transmits data, because the receiving Impala server does not have knowledge of the receiving fragment and returns an error.

This means that the coordinator performs N rounds of fragment start-up, where N is the number of plan fragments in the plan tree (because even within a level, fragment dispatch is serialised left-to-right). Queries often do not start doing useful work (i.e. scans or other leaf operators) until the 2nd or later round of RPCs. Parallelism is lost during RPC dispatch as well, as skew in large clusters usually means that the latency of each round is dominated by one or two stragglers, during which time the coordinator is effectively paused (and could be usefully sending more fragment exec RPCs).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image.png
02/Dec/15 23:19
74 kB
Henry Robinson

Issue Links

blocks

IMPALA-2550 Switch to per-query exec rpc

Resolved

breaks

IMPALA-5199 Impala may hang on empty row batch exchange

Resolved

IMPALA-5910 Data stream sender timeout causes query hang when 0 rows sent through channel

Resolved

is related to

IMPALA-1744 Improve query start-up efficiency.

Resolved

relates to

IMPALA-2684 Fragment start latencies reported in wrong unit

Resolved

Activity

People

Assignee:: Henry Robinson

Reporter:: Henry Robinson

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 11/Dec/14 01:01

Updated:: 07/Sep/17 23:40

Resolved:: 03/Feb/16 04:55