[IMPALA-2929] Add a hint to eliminate self-join for BI systems that don't support nested types yet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: impala 2.3
Fix Version/s: None
Component/s: Frontend
Labels:
Environment:
CDH 5.5; Impala 2.3

Target Version:

Product Backlog

Description

We tested today a workaround for a BI tool to fetch nested data - created an Impala view for a table with nested collection.
OBIEE (and many other BI tools) require to define "dimension" tables (like a customer dimensions), and facts (for example customer transactions). With a table that has a nested collection (fact), all non-nested attributes make up a customer dimension. And nested collection is a fact.

So when we created a view that, e.g.

CREATE VIEW adash_mrt.amf_trans_fact_vw 
AS
select fact.individ, trans.*
from adash_mrt.amf_tsp_fact fact
   , fact.amft trans;

And presented amf_trans_fact_vw to OBIEE as a fact, and adash_mrt.amf_tsp_fact as a customer dimension.

So BI tools like OBIEE then always would try to join that dimension and a fact, for example like

SELECT count(dim.individ), sum(trans.activity_dollars)
FROM amf_tsp_fact dim, amf_trans_fact_vw trans
WHERE dim.individ = trans.individ 
      and trans.activity_type='M'  
      and dim.tsp_gender='2'  
      and dim.final_ace_city='SUPERIOR';

Then Impala would actually self-join that table amf_tsp_fact with a nested collection amft , rendering all benefits of nested collection useless. Because main benefit is avoiding expensive joins, but Impala's query optimizer not that smart to avoid that self-join..

This JIRA is a suggestion to introduce a query-level hint or a new "set" session-level parameter that would eliminate such a self-join for BI tools that do not support nested types directly. It might be a hint in that CREATE VIEW query too.

Here's explain plan for the above query:

10:AGGREGATE [FINALIZE]
|  output: count:merge(dim.individ), sum:merge(CAST(dim.activity_total_dollars_spent AS BIGINT)), sum:merge(trans.activity_dollars)
|
09:EXCHANGE [UNPARTITIONED]
|
07:AGGREGATE
|  output: count(dim.individ), sum(trans.item.activity_dollars)
|
06:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: fact.individ = dim.individ
|
|--08:EXCHANGE [BROADCAST]
|  |
|  00:SCAN HDFS [adash_mrt.amf_tsp_fact dim]
|     partitions=1/1 files=6500 size=291.37GB
|     predicates: dim.tsp_gender = '2', dim.final_ace_city = 'SUPERIOR'
|
02:SUBPLAN
|
|--05:NESTED LOOP JOIN [CROSS JOIN]
|  |
|  |--03:SINGULAR ROW SRC
|  |
|  04:UNNEST [fact.amft trans]
|
01:SCAN HDFS [adash_mrt.amf_tsp_fact fact]
   partitions=1/1 files=6500 size=291.37GB
   predicates on trans: trans.item.activity_type = 'M'

Notice HASH JOIN.

We implemented nested data types at the data lake layer, but stuck with BI tools because they don't support nested collections yet. It seems a low hanging fruit to close the gap.

Attachments

Issue Links

relates to

IMPALA-3531 Implement deferrable and optionally enforced PK/FK constraints

In Progress

Activity

People

Assignee:: Unassigned

Reporter:: Ruslan Dautkhanov

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Feb/16 17:27

Updated:: 17/Apr/20 14:09