Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-237

Allow user-defined aggregate functions (UDAs) to be defined in a model

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Allow user-defined aggregate functions (UDAs) to be defined in a model, using a simple class to implement the UDA.

      A class can implement a UDA if it has init and add methods, and optionally initAdd, merge and result. For example,

      ```java
      class MySumAggFun {
      public MySumAggFun() {}
      public long[] init();
      public long[] initAdd(int v);
      public void add(long[] accumulator, int v);
      public long[] merge(long[] accumulator0, long[] accumulator1);
      public long result(long[] accumulator);
      }
      ```

      Every UDA has argument types (in general it can be a list with 0 or more elements, but usually 1), accumulator type, and result type. These are inferred from the method signatures.

      In this example, the accumulator type is `long[]`, the argument type is `[int]`, and the result type is `long`.

      If there is no `result` method, the accumulator and result types are the same, and the result is just the accumulator.

      If there is no `initAdd` method, Optiq calls `init` followed by `add`.

      The methods can all be static, in which case the class does not need a public zero-argument constructor.

      Example model:

      ```json
      schemas: {
      {
      name: ‘HR’,
      functions: [

      { name: ‘MY_SUM’, className: ‘com.example.MySumAggFun’ }

      ]
      }
      ```

      ---------------- Imported from GitHub ----------------
      Url: https://github.com/julianhyde/optiq/issues/237
      Created by: julianhyde
      Labels:
      Created at: Sun Apr 06 23:31:50 CEST 2014
      State: closed

        Activity

        Hide
        github-import GitHub Import added a comment -

        [Date: Mon Apr 07 07:39:26 CEST 2014, Author: vlsi]

        Do you have a plan how to cope with null behaviour (i.e. strict vs
        non-strict aggregate)?

        When defining `merge` api, you need to define if the aggregate is
        commutative or not. For instance, `sum(int)` is commutative, while
        `sum(float)` is not.

        Finally, for incremental calculation of window functions some `tryUnadd`
        will be required

        Show
        github-import GitHub Import added a comment - [Date: Mon Apr 07 07:39:26 CEST 2014, Author: vlsi ] Do you have a plan how to cope with null behaviour (i.e. strict vs non-strict aggregate)? When defining `merge` api, you need to define if the aggregate is commutative or not. For instance, `sum(int)` is commutative, while `sum(float)` is not. Finally, for incremental calculation of window functions some `tryUnadd` will be required
        Hide
        github-import GitHub Import added a comment -

        [Date: Wed Apr 09 03:28:47 CEST 2014, Author: julianhyde]

        > Do you have a plan how to cope with null behaviour (i.e. strict vs non-strict aggregate)?

        Not part of this change. By all means log a new case. I think an annotation `@Strict(boolean)` against the UDA class would suffice.

        > When defining `merge` api, you need to define if the aggregate is commutative or not. For instance, `sum(int)` is commutative, while `sum(float)` is not.

        If the aggregate does not commute, don't implement the `merge` method.

        In SQL, if you want deterministic results, don't use floating point numbers (Java `float` and `double`, SQL `REAL` or `DOUBLE`) – use the `DECIMAL` type. Given those expectations, `sum(float)` is close enough.

        Remember SQL implementations can (and often do) re-order rows, partition them, and roll them up from previous results.

        > Finally, for incremental calculation of window functions some `tryUnadd` will be required.

        Not sure what you mean by `tryUnadd`, but yes, we could add support for incremental calculation of window functions, and we'd use something like a `drop` method. (This was <a href="https://github.com/LucidDB/luciddb/blob/d322a8b657b243a85e965487dcf95bbcbd36550c/farrago/src/net/sf/farrago/fennel/calc/RexToCalcTranslator.java#L1204">in eigenbase</a>.) Log an issue if this is important to you.

        Show
        github-import GitHub Import added a comment - [Date: Wed Apr 09 03:28:47 CEST 2014, Author: julianhyde ] > Do you have a plan how to cope with null behaviour (i.e. strict vs non-strict aggregate)? Not part of this change. By all means log a new case. I think an annotation `@Strict(boolean)` against the UDA class would suffice. > When defining `merge` api, you need to define if the aggregate is commutative or not. For instance, `sum(int)` is commutative, while `sum(float)` is not. If the aggregate does not commute, don't implement the `merge` method. In SQL, if you want deterministic results, don't use floating point numbers (Java `float` and `double`, SQL `REAL` or `DOUBLE`) – use the `DECIMAL` type. Given those expectations, `sum(float)` is close enough. Remember SQL implementations can (and often do) re-order rows, partition them, and roll them up from previous results. > Finally, for incremental calculation of window functions some `tryUnadd` will be required. Not sure what you mean by `tryUnadd`, but yes, we could add support for incremental calculation of window functions, and we'd use something like a `drop` method. (This was <a href="https://github.com/LucidDB/luciddb/blob/d322a8b657b243a85e965487dcf95bbcbd36550c/farrago/src/net/sf/farrago/fennel/calc/RexToCalcTranslator.java#L1204">in eigenbase</a>.) Log an issue if this is important to you.
        Hide
        github-import GitHub Import added a comment -

        [Date: Wed Apr 09 09:32:30 CEST 2014, Author: vlsi]

        >if you want deterministic results

        If I want deterministic results, I use `order by`
        IEEE 754 is deterministic as well.

        >Remember SQL implementations can (and often do) re-order rows, partition them, and roll them up from previous results.

        It can, however there are hard limits: `order by` is one of them.

        Window function calculation is defined as performing aggregate over a window.

        No roundoff errors are allowed to pass from "previous" windows. In other words, you cannot safely `drop` values from precomputed `sum(float)` aggregate.

        >and we'd use something like a drop method. (This was in eigenbase.) Log an issue if this is important to you.

        It is not yet important for me, so I'll postpone issue creation

        Show
        github-import GitHub Import added a comment - [Date: Wed Apr 09 09:32:30 CEST 2014, Author: vlsi ] >if you want deterministic results If I want deterministic results, I use `order by` IEEE 754 is deterministic as well. >Remember SQL implementations can (and often do) re-order rows, partition them, and roll them up from previous results. It can, however there are hard limits: `order by` is one of them. Window function calculation is defined as performing aggregate over a window. No roundoff errors are allowed to pass from "previous" windows. In other words, you cannot safely `drop` values from precomputed `sum(float)` aggregate. >and we'd use something like a drop method. (This was in eigenbase.) Log an issue if this is important to you. It is not yet important for me, so I'll postpone issue creation
        Hide
        github-import GitHub Import added a comment -

        [Date: Wed Apr 09 11:21:20 CEST 2014, Author: julianhyde]

        Since you're querying java collections, I can see how you hope & expect to get deterministic results. And with Optiq you will probably get them.

        Just bear in mind that in standard SQL, tables are not ordered. Even if there is an `order by`, it might be only applied after all other calculations have been performed.

        Complex queries, say with 2 or 3 joins, are almost certainly not going to be performed in 'nested loops' order.

        So I'm not going to promise that `select sum(sal) from emp join dept using (deptno)` will always produce exactly the same result if `sal` is a `float` column. If I did that I'd be committing that we always use the same join algorithm.

        Show
        github-import GitHub Import added a comment - [Date: Wed Apr 09 11:21:20 CEST 2014, Author: julianhyde ] Since you're querying java collections, I can see how you hope & expect to get deterministic results. And with Optiq you will probably get them. Just bear in mind that in standard SQL, tables are not ordered. Even if there is an `order by`, it might be only applied after all other calculations have been performed. Complex queries, say with 2 or 3 joins, are almost certainly not going to be performed in 'nested loops' order. So I'm not going to promise that `select sum(sal) from emp join dept using (deptno)` will always produce exactly the same result if `sal` is a `float` column. If I did that I'd be committing that we always use the same join algorithm.
        Hide
        github-import GitHub Import added a comment -

        [Date: Wed Apr 09 11:40:33 CEST 2014, Author: vlsi]

        >So I'm not going to promise that `select sum(sal) from emp join dept using (deptno)`

        I do not suggest giving such a promise.

        What do you think of the following query?
        `select empid, sum(empid*0.5f) over (order by empid rows 10 preceding and current row) from emp`

        I am not 100% sure if the specification requires to feed the aggregate function in the order of elements in window (there is an explicit order though).
        Even though an implementation is free to choose the way it feeds the aggregates, any sensible implementation must obey `10 preceding and current row` window specification.

        That means, when advancing to the next row, you cannot just "subtract 1 empid*0.5f and add 1 empid*0.5f". You have to recompute the whole thing from scratch adding 11 rows.

        Show
        github-import GitHub Import added a comment - [Date: Wed Apr 09 11:40:33 CEST 2014, Author: vlsi ] >So I'm not going to promise that `select sum(sal) from emp join dept using (deptno)` I do not suggest giving such a promise. What do you think of the following query? `select empid, sum(empid*0.5f) over (order by empid rows 10 preceding and current row) from emp` I am not 100% sure if the specification requires to feed the aggregate function in the order of elements in window (there is an explicit order though). Even though an implementation is free to choose the way it feeds the aggregates, any sensible implementation must obey `10 preceding and current row` window specification. That means, when advancing to the next row, you cannot just "subtract 1 empid*0.5f and add 1 empid*0.5f". You have to recompute the whole thing from scratch adding 11 rows.
        Hide
        github-import GitHub Import added a comment -

        [Date: Thu Apr 10 21:25:08 CEST 2014, Author: julianhyde]

        My view is that if you want exact results you should use exact arithmetic (e.g. decimal). So I'd rewrite your query as

        ```sql
        select empid, sum(empid) over (order by empid rows 10 preceding and current row) *0.5f from emp
        ```

        If you are prepared to accept inexact results, step up to double and be prepared to throw away the 2 or 3 least significant digits.

        I'm not minimizing the problem. We could certainly use some algorithms that are well-behaved in a numerical analysis sense.

        But SQL's goal is to process large amounts of data efficiently and so it cannot guarantee to always use a particular algorithm – especially not an inefficient algorithm like the quadratic algorithm you suggest for running sum.

        Show
        github-import GitHub Import added a comment - [Date: Thu Apr 10 21:25:08 CEST 2014, Author: julianhyde ] My view is that if you want exact results you should use exact arithmetic (e.g. decimal). So I'd rewrite your query as ```sql select empid, sum(empid) over (order by empid rows 10 preceding and current row) *0.5f from emp ``` If you are prepared to accept inexact results, step up to double and be prepared to throw away the 2 or 3 least significant digits. I'm not minimizing the problem. We could certainly use some algorithms that are well-behaved in a numerical analysis sense. But SQL's goal is to process large amounts of data efficiently and so it cannot guarantee to always use a particular algorithm – especially not an inefficient algorithm like the quadratic algorithm you suggest for running sum.

          People

          • Assignee:
            Unassigned
            Reporter:
            github-import GitHub Import
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development