We tested today a workaround for a BI tool to fetch nested data - created an Impala view for a table with nested collection.
OBIEE (and many other BI tools) require to define "dimension" tables (like a customer dimensions), and facts (for example customer transactions). With a table that has a nested collection (fact), all non-nested attributes make up a customer dimension. And nested collection is a fact.
So when we created a view that, e.g.
And presented amf_trans_fact_vw to OBIEE as a fact, and adash_mrt.amf_tsp_fact as a customer dimension.
So BI tools like OBIEE then always would try to join that dimension and a fact, for example like
Then Impala would actually self-join that table amf_tsp_fact with a nested collection amft , rendering all benefits of nested collection useless. Because main benefit is avoiding expensive joins, but Impala's query optimizer not that smart to avoid that self-join..
This JIRA is a suggestion to introduce a query-level hint or a new "set" session-level parameter that would eliminate such a self-join for BI tools that do not support nested types directly. It might be a hint in that CREATE VIEW query too.
Here's explain plan for the above query:
Notice HASH JOIN.
We implemented nested data types at the data lake layer, but stuck with BI tools because they don't support nested collections yet. It seems a low hanging fruit to close the gap.