Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8081

Avoid over-parallelizing queries when there are small input splits

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Backend

    Description

      Currently we maximise parallelism given the number of input splits available. This is often a good decision, unless there are very many small input splits, particularly small files. We could avoid this pathological behaviour by having a minimum threshold of input bytes per instance (this is still pretty crude, since file input bytes only correlates loosely with the amount of work required).

      An example:

      [localhost.EXAMPLE.COM:21050] default> show files in functional.alltypes;
      Query: show files in functional.alltypes
      +-------------------------------------------------------------------------------+---------+--------------------+
      | Path                                                                          | Size    | Partition          |
      +-------------------------------------------------------------------------------+---------+--------------------+
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=1/090101.txt  | 19.95KB | year=2009/month=1  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=2/090201.txt  | 18.12KB | year=2009/month=2  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=3/090301.txt  | 20.06KB | year=2009/month=3  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=4/090401.txt  | 19.61KB | year=2009/month=4  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=5/090501.txt  | 20.36KB | year=2009/month=5  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=6/090601.txt  | 19.71KB | year=2009/month=6  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=7/090701.txt  | 20.36KB | year=2009/month=7  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=8/090801.txt  | 20.36KB | year=2009/month=8  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=9/090901.txt  | 19.71KB | year=2009/month=9  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=10/091001.txt | 20.36KB | year=2009/month=10 |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=11/091101.txt | 19.71KB | year=2009/month=11 |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=12/091201.txt | 20.36KB | year=2009/month=12 |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=1/100101.txt  | 20.36KB | year=2010/month=1  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=2/100201.txt  | 18.39KB | year=2010/month=2  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=3/100301.txt  | 20.36KB | year=2010/month=3  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=4/100401.txt  | 19.71KB | year=2010/month=4  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=5/100501.txt  | 20.36KB | year=2010/month=5  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=6/100601.txt  | 19.71KB | year=2010/month=6  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=7/100701.txt  | 20.36KB | year=2010/month=7  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=8/100801.txt  | 20.36KB | year=2010/month=8  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=9/100901.txt  | 19.71KB | year=2010/month=9  |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=10/101001.txt | 20.36KB | year=2010/month=10 |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=11/101101.txt | 19.71KB | year=2010/month=11 |
      | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=12/101201.txt | 20.36KB | year=2010/month=12 |
      +-------------------------------------------------------------------------------+---------+--------------------+
      [localhost:21000] default> set mt_dop=8; select count(*) from functional.alltypes; summary;
      MT_DOP set to 8
      Query: select count(*) from functional.alltypes
      Query submitted at: 2020-10-30 17:47:26 (Coordinator: http://tarmstrong-box:25000)
      Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=d242927a09d39968:3ae0ecec00000000
      +----------+
      | count(*) |
      +----------+
      | 7300     |
      +----------+
      Fetched 1 row(s) in 0.19s
      +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+
      | Operator            | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail              |
      +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+
      | F01:ROOT            | 1      | 242.85us | 242.85us |       |            | 0 B       | 0 B           |                     |
      | 03:AGGREGATE        | 1      | 2.51ms   | 2.51ms   | 1     | 1          | 16.00 KB  | 10.00 MB      | FINALIZE            |
      | 02:EXCHANGE         | 1      | 616.31us | 616.31us | 24    | 1          | 240.00 KB | 16.00 KB      | UNPARTITIONED       |
      | F00:EXCHANGE SENDER | 24     | 1.34ms   | 2.66ms   |       |            | 16.00 KB  | 0 B           |                     |
      | 01:AGGREGATE        | 24     | 1.52ms   | 2.13ms   | 24    | 1          | 16.00 KB  | 10.00 MB      |                     |
      | 00:SCAN HDFS        | 24     | 32.76ms  | 35.75ms  | 7.30K | 7.30K      | 32.00 KB  | 16.00 MB      | functional.alltypes |
      +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+
      
      

      In this example, we create 8 instances per impala daemon to scan a tiny amount of data each. We would be better off, typically, in creating fewer instances to avoid the overhead.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            baggio000 Yida Wu
            janulatha Janaki Lahorani

            Dates

              Created:
              Updated:

              Slack

                Issue deployment