Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8081

Avoid over-parallelizing queries when there are small input splits

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Backend
    • Labels:

      Description

      Currently we maximise parallelism given the number of input splits available. This is often a good decision, unless there are very many small input splits, particularly small files. We could avoid this pathological behaviour by having a minimum threshold of input bytes per instance (this is still pretty crude, since file input bytes only correlates loosely with the amount of work required).

      An example:

      [localhost.EXAMPLE.COM:21050] default> show files in functional.alltypes;
      Query: show files in functional.alltypes
      +-------------------------------------------------------------------------------+---------+--------------------+
      | Path                                                                          | Size    | Partition          |
      +-------------------------------------------------------------------------------+---------+--------------------+
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=1/090101.txt  | 19.95KB | year=2009/month=1  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=2/090201.txt  | 18.12KB | year=2009/month=2  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=3/090301.txt  | 20.06KB | year=2009/month=3  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=4/090401.txt  | 19.61KB | year=2009/month=4  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=5/090501.txt  | 20.36KB | year=2009/month=5  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=6/090601.txt  | 19.71KB | year=2009/month=6  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=7/090701.txt  | 20.36KB | year=2009/month=7  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=8/090801.txt  | 20.36KB | year=2009/month=8  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=9/090901.txt  | 19.71KB | year=2009/month=9  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=10/091001.txt | 20.36KB | year=2009/month=10 |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=11/091101.txt | 19.71KB | year=2009/month=11 |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=12/091201.txt | 20.36KB | year=2009/month=12 |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=1/100101.txt  | 20.36KB | year=2010/month=1  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=2/100201.txt  | 18.39KB | year=2010/month=2  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=3/100301.txt  | 20.36KB | year=2010/month=3  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=4/100401.txt  | 19.71KB | year=2010/month=4  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=5/100501.txt  | 20.36KB | year=2010/month=5  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=6/100601.txt  | 19.71KB | year=2010/month=6  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=7/100701.txt  | 20.36KB | year=2010/month=7  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=8/100801.txt  | 20.36KB | year=2010/month=8  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=9/100901.txt  | 19.71KB | year=2010/month=9  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=10/101001.txt | 20.36KB | year=2010/month=10 |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=11/101101.txt | 19.71KB | year=2010/month=11 |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=12/101201.txt | 20.36KB | year=2010/month=12 |
      +-------------------------------------------------------------------------------+---------+--------------------+
      [localhost:21000] default> set mt_dop=8; select count(*) from functional.alltypes; summary;
      MT_DOP set to 8
      Query: select count(*) from functional.alltypes
      Query submitted at: 2020-10-30 17:47:26 (Coordinator: http://tarmstrong-box:25000)
      Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=d242927a09d39968:3ae0ecec00000000
      +----------+
      | count(*) |
      +----------+
      | 7300     |
      +----------+
      Fetched 1 row(s) in 0.19s
      +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+
      | Operator            | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail              |
      +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+
      | F01:ROOT            | 1      | 242.85us | 242.85us |       |            | 0 B       | 0 B           |                     |
      | 03:AGGREGATE        | 1      | 2.51ms   | 2.51ms   | 1     | 1          | 16.00 KB  | 10.00 MB      | FINALIZE            |
      | 02:EXCHANGE         | 1      | 616.31us | 616.31us | 24    | 1          | 240.00 KB | 16.00 KB      | UNPARTITIONED       |
      | F00:EXCHANGE SENDER | 24     | 1.34ms   | 2.66ms   |       |            | 16.00 KB  | 0 B           |                     |
      | 01:AGGREGATE        | 24     | 1.52ms   | 2.13ms   | 24    | 1          | 16.00 KB  | 10.00 MB      |                     |
      | 00:SCAN HDFS        | 24     | 32.76ms  | 35.75ms  | 7.30K | 7.30K      | 32.00 KB  | 16.00 MB      | functional.alltypes |
      +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+
      
      

      In this example, we create 8 instances per impala daemon to scan a tiny amount of data each. We would be better off, typically, in creating fewer instances to avoid the overhead.

        Attachments

          Activity

            People

            • Assignee:
              baggio000 Yida Wu
              Reporter:
              janulatha Janaki Lahorani

              Dates

              • Created:
                Updated:

                Issue deployment