Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-838

Optimize "join to most recent" queries

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      A common class of queries joins a fact table to an "current" record in a dimension table. What appears to be a many-to-many join becomes many-to-one because there is only one current record.

      Here the query on orders (fact) and customers (dimension):

      SELECT *
      FROM Orders
      JOIN Customers AS c USING (customerId)
      WHERE NOT EXISTS (
        SELECT 1
        FROM Customers
        WHERE customerId = c.CustomerId 
        AND effectiveDate > c.effectiveDate)

      Same query pushing WHERE into JOIN:

      SELECT *
      FROM Orders AS o
      JOIN Customers AS c
      ON o.customerId = c.customerId
      AND NOT EXISTS (
        SELECT 1
        FROM Customers
        WHERE customerId = c.CustomerId 
        AND effectiveDate > c.effectiveDate)

      The same query re-formulated using LAST_VALUE:

      SELECT *
      FROM Orders AS o
      JOIN (
        SELECT last_value(customerId) OVER w AS customerId,
          last_value(customerName) OVER w AS customerName
        FROM Customers AS c
        WINDOW w AS (PARTITION BY customerId
          ORDER BY effectiveDate))

      This formulation avoids the self-join, and is uncorrelated. Both good things.

      It could probably also be written using correlated NOT IN. (Not good, because any plan chosen for NOT IN has to watch out for null values.)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              julianhyde Julian Hyde
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: