[IMPALA-10809] improve the performance of unnest operation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- complextype

Epic Link:
Built-in function support for complex types
Epic Color:
ghx-label-7

Description

current situation

Impala's support for complex data types is not particularly friendly.

For example, if you need to expand rows containing Array type fields, you need to unnest the array fields first, and then do a nested loop join.

If you need to expand multiple array fields, you need to do multiple unnests, And perform multiple unest and nested loop joins, which puts a lot of computational pressure on the executor.

DDL:

CREATE TABLE rawdata.users2 (                                     
 day INT,                                                        
 sampling_group INT,                                             
 user_id BIGINT,                                                 
  time TIMESTAMP,                                                 
 _offset BIGINT,                                                 
 event_id INT,                                                   
 month_id INT,                                                   
 week_id INT,                                                    
 distinct_id STRING,                                             
 event_bucket INT,                                               
 adresses_list_string ARRAY<STRING>,                             
 count_list_bigint ARRAY<BIGINT>                                 
 )                                                                 
 WITH SERDEPROPERTIES ('serialization.format'='1')                 
 STORED AS PARQUET                                                 
 LOCATION 'hdfs://localhost:20500/test-warehouse/rawdata.db/users2'

Query SQL:

SELECT
    `day`,
    list`.item,
   list1.item 
 FROM
   rawdata.users2,
   rawdata.users2.adresses_list_string list1,
   rawdata.users2.count_list_bigint list2

Simplified Plan:

F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|
07:EXCHANGE [UNPARTITIONED]
|
01:SUBPLAN
|
|--06:NESTED LOOP JOIN [CROSS JOIN]
| |
| |--04:UNNEST [users2.count_list_bigint clist]
| |
| 05:NESTED LOOP JOIN [CROSS JOIN]
| |
| |--02:SINGULAR ROW SRC
| |
| 03:UNNEST [users2.adresses_list_string list]
|
00:SCAN HDFS [rawdata.users2, RANDOM]

Improve Solution

In actual use, I found that if some changes are made to the calculation logic of unnest, the calculation performance will be greatly improved:

At first, in FE construct and new plan type, named explode node, it and it's child node construct a pipeline operation

then, in BE, the raw was explode locally, and the fileds layout as childnode

the query sql and Plan greatly simplified：

Query SQL:

SELECT
    `day`,
   explode(adresses_list_string),
   explode(count_list_bigint) 
 from
   rawdata.users2

the simplified Plan as this：

F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|
02:EXCHANGE [UNPARTITIONED]
|
01:EXPLODE NODE [UNPARTITIONED]
|
00:SCAN HDFS [rawdata.users2, RANDOM]

Attachments

Activity

People

Assignee:: Csaba Ringhofer

Reporter:: pengdou1990

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 20/Jul/21 08:04

Updated:: 06/Sep/23 06:12