Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5260

Have query optimizer make joined tables distinct to improve performance

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
    • None
    • Frontend
    • ghx-label-3

    Description

      Consider the following select statement:

      select tB.bField, count(tA.aField) ct
      from tableA tA
      join tableB tB using (id)
      where (...)
      group by tB.bField
      order by ct
      

      if tableB has a large number of rows (but still less than tableA), performance can be orders of magnitude slower than the equivalent query:

      select tB.bField, count(tA.aField) ct
      from tableA tA
      join (select distinct bField, id[, ...] from tableB) tB using (id)
      where (...)
      group by tB.bField
      order by ct
      

      It appears to me that the slower query gets bogged down with shuttling unnecessary data between nodes.

      Is it possible, and beneficial, to make such a query improvement implicit in Impala's query optimizer?

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            ski309 Michael Sokalski

            Dates

              Created:
              Updated:

              Slack

                Issue deployment