Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-13019

Optimizer COLLECT_LIST/COLLECT_SET

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • CBO, Logical Optimizer
    • None

    Description

      Currently when using a COLLECT_SET/COLLECT_LIST that involves data from a single table, the aggregation is done after any JOIN operation that is present in the query. For example:

      insert into table nested_customers_orders
      select c.*, collect_list(named_struct("oid", o.oid, "order_date": o.date...))
      from customers c inner join orders o on (c.cid = o.oid)
      group by o.oid, o.date,...
      

      If we can tell the optimizer to perform the COLLECT_LIST first (where possible) we can see some performance gains in this pattern of query.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            cotedm Dustin Cote

            Dates

              Created:
              Updated:

              Slack

                Issue deployment