Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7789

Exchanges are slow on large systems & queries

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.16.0
    • None
    • None
    • None

    Description

      A user with moderate-sized cluster and query has experienced extreme slowness in exchanges. Up to 11/12 of the time is spent waiting in one query, 3/4 of time spent waiting in another. We suspect that exchanges are somehow serializing across the cluster.

      Cluster:

      • Drill 1.16 (MapR version)
      • MapR-FS
      • Data stored in a 8GB Parquet file, unpacks to about 80 GB, 20B records
      • 4 Drillbits
      • Each node has 56 cores, 400 GB of memory
      • Drill queries run with 40 fragments (70% of CPU) and 80 GB of memory

      The query is, essentially:

      Parquet writer
      - Hash Join
        - Scan
        - Window, Sort
        - Window, Sort
        - Hash Join
          - Scan
          - Scan
      

      In the above, each line represents a fragment boundary. The plan includes mux exchanges between the two "lower" scans and the hash join.

      The total query time is 6 hours. Of that, 30 minutes is spent working, the other 5.5 hours is spent waiting. (The 30 minutes is obtained by summing the "Avg Runtime" column in the profile.)

      When checking resource usage with "top", we found that only a small amount of CPU was used. We should have seen 4000% (40 cores) but we actually saw just around 300-400%. This again indicates that the query spent most of its time doing nothing: not using CPU.

      In particular the sender spends about 5 hours waiting for the receiver, which in turn spends about 5 hours waiting for the sender. This pattern occurs in every exchange in the "main" data path (the 20B records.)

      As an experiment, the user disabled Mux exchanges. The system became overloaded at 40 fragments per node, so parallelism was reduced to 20. Now, the partition sender waited for the unordered receiver and visa-versa.

      The original query incurred spilling. We hypothesized that the spilling caused delays which somehow rippled through the DAG. However, the user revised the query to eliminate spilling and to reduce the query to just the "bottom" hash join. The query ran for an hour, of which 3/4 of the time was again spent with senders and receivers waiting for each other.

      We have eliminated a number of potential causes:

      • System has sufficient memory
      • MapRFS file system has plenty of spindles and plenty of I/O capability.
      • Network is fast
      • No other load on the nodes
      • Query was simplified down to the simplest possible: a single join (with exchanges)
      • If the query is simplified further (scan and write to Parquet, no join), it completes in just a few minutes: about as fast as the disk I/O rate.

      The query profile does not provide sufficient information to dig further. The profile provides aggregate wait times, but does not, say, tell us which fragments wait for which other fragments for how long.

      We believe that, if the exchange delays are fixed, the query which takes six hours should complete in less than a half hour – even with shuffles, spilling, reading from Parquet and writing to Parquet.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Paul.Rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: