Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-768

Hopping windows

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • stream
    • None

    Description

      Recall that a hopping window emits a sub-total every X seconds of
      records that have arrived over the last Y seconds. A tumbling window
      is a hopping window where X and Y are equal.

      In https://calcite.incubator.apache.org/docs/stream.html#hopping-windows
      I give an example, "emit, every hour, the number of each product
      ordered over the past three hours".

      That example gives a query in terms of a GROUP BY (in the HourlyTotals
      view) followed by a moving sum. I didn't think that it was possible to
      express using just one GROUP BY, because that would violate one of the
      principles of SQL: that each record entering a GROUP BY contributes to
      precisely one output record.

      But I've just realized that the CUBE, ROLLUP and GROUPING SETS
      operators (already in SQL) violate that principle. And if they can do
      it, we can do the same. So we could add another grouping function,
      HOP(t, emit, retain).

      The query would look like this:

      SELECT STREAM START(rowtime) AS rowtime,
        productId,
        SUM(units) AS sumUnits,
        COUNT(*) AS c
      FROM Orders
      GROUP BY HOP(rowtime, INTERVAL '1' HOUR, INTERVAL '3' HOUR),
        productId
      

      Much nicer than the one in stream.html!

      The "trick" is that the HOP function is returning a list of rowtime
      values. For example, for row 1

      {rowtime: '09:33', ...}

      it will return
      ['09:00', '10:00', '11:00']; for row 2

      {rowtime: '10:05', ...}

      it will
      return ['10:00', '11:00', '12:00']. The system adds each row to
      several sub-totals, and emits each sub-total when it is complete. The
      sub-total for '09:00' will contain only row 1, and will be emitted at
      '10:00'; the sub-total for '10:00' will contain row 1 and row 2, and
      will be emitted at '11:00', and so forth.

      Returning multiple values is related to the flatMap function in Spark
      (and earlier selectMany in LINQ) and makes HOP's semantics similar to
      GROUPING SETS and therefore sound.

      START is a new aggregate function that returns the lower bound of the
      current sub-total; END similarly.

      Note that the "retain" argument does not need to be a whole multiple
      of the "emit" argument. This was a major limitation in the previous
      proposal.

      There are some straightforward extensions:

      • Define a TUMBLE function;
      • Add an "align" argument to HOP, to allow windows to start at, say, 5
        minutes past each hour;
      • Apply HOP to windows based on row-counts;
      • Allow user-defined windowing functions that similarly return a list
        of interval start-end points.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              julianhyde Julian Hyde
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: