[DRILL-7789] Exchanges are slow on large systems & queries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.16.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

A user with moderate-sized cluster and query has experienced extreme slowness in exchanges. Up to 11/12 of the time is spent waiting in one query, 3/4 of time spent waiting in another. We suspect that exchanges are somehow serializing across the cluster.

Cluster:

Drill 1.16 (MapR version)
MapR-FS
Data stored in a 8GB Parquet file, unpacks to about 80 GB, 20B records
4 Drillbits
Each node has 56 cores, 400 GB of memory
Drill queries run with 40 fragments (70% of CPU) and 80 GB of memory

The query is, essentially:

Parquet writer
- Hash Join
  - Scan
  - Window, Sort
  - Window, Sort
  - Hash Join
    - Scan
    - Scan

In the above, each line represents a fragment boundary. The plan includes mux exchanges between the two "lower" scans and the hash join.

The total query time is 6 hours. Of that, 30 minutes is spent working, the other 5.5 hours is spent waiting. (The 30 minutes is obtained by summing the "Avg Runtime" column in the profile.)

When checking resource usage with "top", we found that only a small amount of CPU was used. We should have seen 4000% (40 cores) but we actually saw just around 300-400%. This again indicates that the query spent most of its time doing nothing: not using CPU.

In particular the sender spends about 5 hours waiting for the receiver, which in turn spends about 5 hours waiting for the sender. This pattern occurs in every exchange in the "main" data path (the 20B records.)

As an experiment, the user disabled Mux exchanges. The system became overloaded at 40 fragments per node, so parallelism was reduced to 20. Now, the partition sender waited for the unordered receiver and visa-versa.

The original query incurred spilling. We hypothesized that the spilling caused delays which somehow rippled through the DAG. However, the user revised the query to eliminate spilling and to reduce the query to just the "bottom" hash join. The query ran for an hour, of which 3/4 of the time was again spent with senders and receivers waiting for each other.

We have eliminated a number of potential causes:

System has sufficient memory
MapRFS file system has plenty of spindles and plenty of I/O capability.
Network is fast
No other load on the nodes
Query was simplified down to the simplest possible: a single join (with exchanges)
If the query is simplified further (scan and write to Parquet, no join), it completes in just a few minutes: about as fast as the disk I/O rate.

The query profile does not provide sufficient information to dig further. The profile provides aggregate wait times, but does not, say, tell us which fragments wait for which other fragments for how long.

We believe that, if the exchange delays are fixed, the query which takes six hours should complete in less than a half hour – even with shuffles, spilling, reading from Parquet and writing to Parquet.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 23/Sep/20 21:14

Updated:: 23/Sep/20 21:14