[IMPALA-8214] Bad plan in load_nested.py - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 3.1.0
Fix Version/s: Impala 3.2.0
Component/s: Infrastructure
Labels:
None

Epic Color:
ghx-label-8

Description

The plan for the below SQL, which is executed without stats, has the larger input on the build side of the join and does a broadcast join, which is very suboptimal. This causes high memory consumption when loading larger scale factors, and generally makes the loading process slower than necessary. We should flip the join and make it a shuffle join.

https://github.com/apache/impala/blob/d481cd4/testdata/bin/load_nested.py#L123


      tmp_customer_sql = r"""
          SELECT
            c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment,
            c_comment,
            GROUP_CONCAT(
              CONCAT(
                CAST(o_orderkey AS STRING), '\003',
                CAST(o_orderstatus AS STRING), '\003',
                CAST(o_totalprice AS STRING), '\003',
                CAST(o_orderdate AS STRING), '\003',
                CAST(o_orderpriority AS STRING), '\003',
                CAST(o_clerk AS STRING), '\003',
                CAST(o_shippriority AS STRING), '\003',
                CAST(o_comment AS STRING), '\003',
                CAST(lineitems_string AS STRING)
              ), '\002'
            ) orders_string
          FROM {source_db}.customer
          LEFT JOIN tmp_orders_string ON c_custkey = o_custkey
          WHERE c_custkey % {chunks} = {chunk_idx}
          GROUP BY 1, 2, 3, 4, 5, 6, 7, 8""".format(**sql_params)

Attachments

Activity

People

Assignee:: Tim Armstrong

Reporter:: Tim Armstrong

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Feb/19 18:28

Updated:: 21/Feb/19 05:02

Resolved:: 21/Feb/19 05:02