Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
-
ghx-label-7
Description
current situation
Impala's support for complex data types is not particularly friendly.
For example, if you need to expand rows containing Array type fields, you need to unnest the array fields first, and then do a nested loop join.
If you need to expand multiple array fields, you need to do multiple unnests, And perform multiple unest and nested loop joins, which puts a lot of computational pressure on the executor.
DDL:
CREATE TABLE rawdata.users2 ( day INT, sampling_group INT, user_id BIGINT, time TIMESTAMP, _offset BIGINT, event_id INT, month_id INT, week_id INT, distinct_id STRING, event_bucket INT, adresses_list_string ARRAY<STRING>, count_list_bigint ARRAY<BIGINT> ) WITH SERDEPROPERTIES ('serialization.format'='1') STORED AS PARQUET LOCATION 'hdfs://localhost:20500/test-warehouse/rawdata.db/users2'
Query SQL:
SELECT `day`, list`.item, list1.item FROM rawdata.users2, rawdata.users2.adresses_list_string list1, rawdata.users2.count_list_bigint list2
Simplified Plan:
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | 07:EXCHANGE [UNPARTITIONED] | 01:SUBPLAN | |--06:NESTED LOOP JOIN [CROSS JOIN] | | | |--04:UNNEST [users2.count_list_bigint clist] | | | 05:NESTED LOOP JOIN [CROSS JOIN] | | | |--02:SINGULAR ROW SRC | | | 03:UNNEST [users2.adresses_list_string list] | 00:SCAN HDFS [rawdata.users2, RANDOM]
Improve Solution
In actual use, I found that if some changes are made to the calculation logic of unnest, the calculation performance will be greatly improved:
At first, in FE construct and new plan type, named explode node, it and it's child node construct a pipeline operation
then, in BE, the raw was explode locally, and the fileds layout as childnode
the query sql and Plan greatly simplified:
Query SQL:
SELECT `day`, explode(adresses_list_string), explode(count_list_bigint) from rawdata.users2
the simplified Plan as this:
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | 02:EXCHANGE [UNPARTITIONED] | 01:EXPLODE NODE [UNPARTITIONED] | 00:SCAN HDFS [rawdata.users2, RANDOM]