Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10809

improve the performance of unnest operation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      current situation

      Impala's support for complex data types is not particularly friendly.

      For example, if you need to expand rows containing Array type fields, you need to unnest the array fields first, and then do a nested loop join.

      If you need to expand multiple array fields, you need to do multiple unnests, And perform multiple unest and nested loop joins, which puts a lot of computational pressure on the executor.

      DDL:

      CREATE TABLE rawdata.users2 (                                     
       day INT,                                                        
       sampling_group INT,                                             
       user_id BIGINT,                                                 
        time TIMESTAMP,                                                 
       _offset BIGINT,                                                 
       event_id INT,                                                   
       month_id INT,                                                   
       week_id INT,                                                    
       distinct_id STRING,                                             
       event_bucket INT,                                               
       adresses_list_string ARRAY<STRING>,                             
       count_list_bigint ARRAY<BIGINT>                                 
       )                                                                 
       WITH SERDEPROPERTIES ('serialization.format'='1')                 
       STORED AS PARQUET                                                 
       LOCATION 'hdfs://localhost:20500/test-warehouse/rawdata.db/users2'

      Query SQL:

      SELECT
          `day`,
          list`.item,
         list1.item 
       FROM
         rawdata.users2,
         rawdata.users2.adresses_list_string list1,
         rawdata.users2.count_list_bigint list2

      Simplified Plan:

       

      F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
      |
      07:EXCHANGE [UNPARTITIONED]
      |
      01:SUBPLAN
      |
      |--06:NESTED LOOP JOIN [CROSS JOIN]
      | |
      | |--04:UNNEST [users2.count_list_bigint clist]
      | |
      | 05:NESTED LOOP JOIN [CROSS JOIN]
      | |
      | |--02:SINGULAR ROW SRC
      | |
      | 03:UNNEST [users2.adresses_list_string list]
      |
      00:SCAN HDFS [rawdata.users2, RANDOM]
      
       

      Improve Solution

      In actual use, I found that if some changes are made to the calculation logic of unnest, the calculation performance will be greatly improved:

      At first, in FE construct and new plan type, named explode node, it and it's child node construct a pipeline operation

      then, in BE, the raw was explode locally, and the fileds layout as childnode

      the query sql and Plan greatly simplified:

      Query SQL:

      SELECT
          `day`,
         explode(adresses_list_string),
         explode(count_list_bigint) 
       from
         rawdata.users2

      the simplified Plan as this:

      F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
      |
      02:EXCHANGE [UNPARTITIONED]
      |
      01:EXPLODE NODE [UNPARTITIONED]
      |
      00:SCAN HDFS [rawdata.users2, RANDOM] 

       

      Attachments

        Activity

          People

            csringhofer Csaba Ringhofer
            pengdou1990 pengdou1990
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: