Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2929

Add a hint to eliminate self-join for BI systems that don't support nested types yet

    XMLWordPrintableJSON

    Details

      Description

      We tested today a workaround for a BI tool to fetch nested data - created an Impala view for a table with nested collection.
      OBIEE (and many other BI tools) require to define "dimension" tables (like a customer dimensions), and facts (for example customer transactions). With a table that has a nested collection (fact), all non-nested attributes make up a customer dimension. And nested collection is a fact.

      So when we created a view that, e.g.

      CREATE VIEW adash_mrt.amf_trans_fact_vw 
      AS
      select fact.individ, trans.*
      from adash_mrt.amf_tsp_fact fact
         , fact.amft trans;
      

      And presented amf_trans_fact_vw to OBIEE as a fact, and adash_mrt.amf_tsp_fact as a customer dimension.

      So BI tools like OBIEE then always would try to join that dimension and a fact, for example like

      SELECT count(dim.individ), sum(trans.activity_dollars)
      FROM amf_tsp_fact dim, amf_trans_fact_vw trans
      WHERE dim.individ = trans.individ 
            and trans.activity_type='M'  
            and dim.tsp_gender='2'  
            and dim.final_ace_city='SUPERIOR';
      

      Then Impala would actually self-join that table amf_tsp_fact with a nested collection amft , rendering all benefits of nested collection useless. Because main benefit is avoiding expensive joins, but Impala's query optimizer not that smart to avoid that self-join..

      This JIRA is a suggestion to introduce a query-level hint or a new "set" session-level parameter that would eliminate such a self-join for BI tools that do not support nested types directly. It might be a hint in that CREATE VIEW query too.

      Here's explain plan for the above query:

      10:AGGREGATE [FINALIZE]
      |  output: count:merge(dim.individ), sum:merge(CAST(dim.activity_total_dollars_spent AS BIGINT)), sum:merge(trans.activity_dollars)
      |
      09:EXCHANGE [UNPARTITIONED]
      |
      07:AGGREGATE
      |  output: count(dim.individ), sum(trans.item.activity_dollars)
      |
      06:HASH JOIN [INNER JOIN, BROADCAST]
      |  hash predicates: fact.individ = dim.individ
      |
      |--08:EXCHANGE [BROADCAST]
      |  |
      |  00:SCAN HDFS [adash_mrt.amf_tsp_fact dim]
      |     partitions=1/1 files=6500 size=291.37GB
      |     predicates: dim.tsp_gender = '2', dim.final_ace_city = 'SUPERIOR'
      |
      02:SUBPLAN
      |
      |--05:NESTED LOOP JOIN [CROSS JOIN]
      |  |
      |  |--03:SINGULAR ROW SRC
      |  |
      |  04:UNNEST [fact.amft trans]
      |
      01:SCAN HDFS [adash_mrt.amf_tsp_fact fact]
         partitions=1/1 files=6500 size=291.37GB
         predicates on trans: trans.item.activity_type = 'M'
      

      Notice HASH JOIN.

      We implemented nested data types at the data lake layer, but stuck with BI tools because they don't support nested collections yet. It seems a low hanging fruit to close the gap.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tagar_impala_e3b3 Ruslan Dautkhanov
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: