Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
impala 2.3
-
None
-
CDH 5.5; Impala 2.3
Description
We tested today a workaround for a BI tool to fetch nested data - created an Impala view for a table with nested collection.
OBIEE (and many other BI tools) require to define "dimension" tables (like a customer dimensions), and facts (for example customer transactions). With a table that has a nested collection (fact), all non-nested attributes make up a customer dimension. And nested collection is a fact.
So when we created a view that, e.g.
CREATE VIEW adash_mrt.amf_trans_fact_vw AS select fact.individ, trans.* from adash_mrt.amf_tsp_fact fact , fact.amft trans;
And presented amf_trans_fact_vw to OBIEE as a fact, and adash_mrt.amf_tsp_fact as a customer dimension.
So BI tools like OBIEE then always would try to join that dimension and a fact, for example like
SELECT count(dim.individ), sum(trans.activity_dollars) FROM amf_tsp_fact dim, amf_trans_fact_vw trans WHERE dim.individ = trans.individ and trans.activity_type='M' and dim.tsp_gender='2' and dim.final_ace_city='SUPERIOR';
Then Impala would actually self-join that table amf_tsp_fact with a nested collection amft , rendering all benefits of nested collection useless. Because main benefit is avoiding expensive joins, but Impala's query optimizer not that smart to avoid that self-join..
This JIRA is a suggestion to introduce a query-level hint or a new "set" session-level parameter that would eliminate such a self-join for BI tools that do not support nested types directly. It might be a hint in that CREATE VIEW query too.
Here's explain plan for the above query:
10:AGGREGATE [FINALIZE] | output: count:merge(dim.individ), sum:merge(CAST(dim.activity_total_dollars_spent AS BIGINT)), sum:merge(trans.activity_dollars) | 09:EXCHANGE [UNPARTITIONED] | 07:AGGREGATE | output: count(dim.individ), sum(trans.item.activity_dollars) | 06:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: fact.individ = dim.individ | |--08:EXCHANGE [BROADCAST] | | | 00:SCAN HDFS [adash_mrt.amf_tsp_fact dim] | partitions=1/1 files=6500 size=291.37GB | predicates: dim.tsp_gender = '2', dim.final_ace_city = 'SUPERIOR' | 02:SUBPLAN | |--05:NESTED LOOP JOIN [CROSS JOIN] | | | |--03:SINGULAR ROW SRC | | | 04:UNNEST [fact.amft trans] | 01:SCAN HDFS [adash_mrt.amf_tsp_fact fact] partitions=1/1 files=6500 size=291.37GB predicates on trans: trans.item.activity_type = 'M'
Notice HASH JOIN.
We implemented nested data types at the data lake layer, but stuck with BI tools because they don't support nested collections yet. It seems a low hanging fruit to close the gap.
Attachments
Issue Links
- relates to
-
IMPALA-3531 Implement deferrable and optionally enforced PK/FK constraints
- In Progress