Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Currently we maximise parallelism given the number of input splits available. This is often a good decision, unless there are very many small input splits, particularly small files. We could avoid this pathological behaviour by having a minimum threshold of input bytes per instance (this is still pretty crude, since file input bytes only correlates loosely with the amount of work required).
An example:
[localhost.EXAMPLE.COM:21050] default> show files in functional.alltypes; Query: show files in functional.alltypes +-------------------------------------------------------------------------------+---------+--------------------+ | Path | Size | Partition | +-------------------------------------------------------------------------------+---------+--------------------+ | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=1/090101.txt | 19.95KB | year=2009/month=1 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=2/090201.txt | 18.12KB | year=2009/month=2 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=3/090301.txt | 20.06KB | year=2009/month=3 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=4/090401.txt | 19.61KB | year=2009/month=4 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=5/090501.txt | 20.36KB | year=2009/month=5 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=6/090601.txt | 19.71KB | year=2009/month=6 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=7/090701.txt | 20.36KB | year=2009/month=7 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=8/090801.txt | 20.36KB | year=2009/month=8 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=9/090901.txt | 19.71KB | year=2009/month=9 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=10/091001.txt | 20.36KB | year=2009/month=10 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=11/091101.txt | 19.71KB | year=2009/month=11 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=12/091201.txt | 20.36KB | year=2009/month=12 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=1/100101.txt | 20.36KB | year=2010/month=1 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=2/100201.txt | 18.39KB | year=2010/month=2 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=3/100301.txt | 20.36KB | year=2010/month=3 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=4/100401.txt | 19.71KB | year=2010/month=4 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=5/100501.txt | 20.36KB | year=2010/month=5 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=6/100601.txt | 19.71KB | year=2010/month=6 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=7/100701.txt | 20.36KB | year=2010/month=7 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=8/100801.txt | 20.36KB | year=2010/month=8 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=9/100901.txt | 19.71KB | year=2010/month=9 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=10/101001.txt | 20.36KB | year=2010/month=10 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=11/101101.txt | 19.71KB | year=2010/month=11 | | hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=12/101201.txt | 20.36KB | year=2010/month=12 | +-------------------------------------------------------------------------------+---------+--------------------+ [localhost:21000] default> set mt_dop=8; select count(*) from functional.alltypes; summary; MT_DOP set to 8 Query: select count(*) from functional.alltypes Query submitted at: 2020-10-30 17:47:26 (Coordinator: http://tarmstrong-box:25000) Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=d242927a09d39968:3ae0ecec00000000 +----------+ | count(*) | +----------+ | 7300 | +----------+ Fetched 1 row(s) in 0.19s +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+ | F01:ROOT | 1 | 242.85us | 242.85us | | | 0 B | 0 B | | | 03:AGGREGATE | 1 | 2.51ms | 2.51ms | 1 | 1 | 16.00 KB | 10.00 MB | FINALIZE | | 02:EXCHANGE | 1 | 616.31us | 616.31us | 24 | 1 | 240.00 KB | 16.00 KB | UNPARTITIONED | | F00:EXCHANGE SENDER | 24 | 1.34ms | 2.66ms | | | 16.00 KB | 0 B | | | 01:AGGREGATE | 24 | 1.52ms | 2.13ms | 24 | 1 | 16.00 KB | 10.00 MB | | | 00:SCAN HDFS | 24 | 32.76ms | 35.75ms | 7.30K | 7.30K | 32.00 KB | 16.00 MB | functional.alltypes | +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+
In this example, we create 8 instances per impala daemon to scan a tiny amount of data each. We would be better off, typically, in creating fewer instances to avoid the overhead.