Apache Drill
  1. Apache Drill
  2. DRILL-556

Implement aggregate functions to compute standard deviation, variance

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: None
    • Labels:
      None

      Description

      Following are the aggregate functions to be added as part of this JIRA

      stddev()
      stddev_samp()
      stddev_pop()

      variance()
      var_samp()
      var_pop()

      1. DRILL-556.patch
        61 kB
        Mehant Baid

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        15m 55s 1 Mehant Baid 23/Apr/14 00:52
        Patch Available Patch Available Resolved Resolved
        11d 17h 6m 1 Jacques Nadeau 04/May/14 17:59
        Tony Stevenson made changes -
        Workflow no-reopen-closed, patch-avail, testing [ 12860075 ] Drill workflow [ 12933676 ]
        Jacques Nadeau made changes -
        Fix Version/s 0.4.0 [ 12324963 ]
        Hide
        ASF GitHub Bot added a comment -

        Github user mehant closed the pull request at:

        https://github.com/apache/incubator-drill/pull/56

        Show
        ASF GitHub Bot added a comment - Github user mehant closed the pull request at: https://github.com/apache/incubator-drill/pull/56
        Hide
        ASF GitHub Bot added a comment -

        Github user mehant commented on the pull request:

        https://github.com/apache/incubator-drill/pull/56#issuecomment-42269971

        merged as eedb4d7c47c0cc021f8c434e6910a8574104531e

        Show
        ASF GitHub Bot added a comment - Github user mehant commented on the pull request: https://github.com/apache/incubator-drill/pull/56#issuecomment-42269971 merged as eedb4d7c47c0cc021f8c434e6910a8574104531e
        Jake Farrell made changes -
        Workflow no-reopen-closed, patch-avail [ 12857330 ] no-reopen-closed, patch-avail, testing [ 12860075 ]
        Jacques Nadeau made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Jacques Nadeau added a comment -

        merged in eedb4d7

        Show
        Jacques Nadeau added a comment - merged in eedb4d7
        Mehant Baid made changes -
        Attachment DRILL-556.patch [ 12643131 ]
        Hide
        Mehant Baid added a comment -

        Depends on DRILL-549.
        Github repo: https://github.com/mehant/incubator-drill/commits/function_implementation_phase_2 contains the sequence of patches.

        Show
        Mehant Baid added a comment - Depends on DRILL-549 . Github repo: https://github.com/mehant/incubator-drill/commits/function_implementation_phase_2 contains the sequence of patches.
        Mehant Baid made changes -
        Attachment DRILL-556.patch [ 12641375 ]
        Hide
        ASF GitHub Bot added a comment -

        Github user amansinha100 commented on a diff in the pull request:

        https://github.com/apache/incubator-drill/pull/56#discussion_r12080233

        — Diff: exec/java-exec/src/main/codegen/templates/AggrTypeFunctions3.java —
        @@ -0,0 +1,128 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +<@pp.dropOutputFile />
        +
        +
        +
        +<#list aggrtypes2.aggrtypes as aggrtype>
        +<#if aggrtype.className != "Avg">
        +<@pp.changeOutputFile name="/org/apache/drill/exec/expr/fn/impl/gaggr/$

        {aggrtype.className}Functions.java" />
        +
        +<#include "/@includes/license.ftl" />
        +
        +<#-- A utility class that is used to generate java code for aggr functions such as stddev, variance -->
        +
        +/*
        + * This class is automatically generated from AggrTypeFunctions2.tdd using FreeMarker.
        + */
        +
        +package org.apache.drill.exec.expr.fn.impl.gaggr;
        +
        +import org.apache.drill.exec.expr.DrillAggFunc;
        +import org.apache.drill.exec.expr.annotations.FunctionTemplate;
        +import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
        +import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
        +import org.apache.drill.exec.expr.annotations.Output;
        +import org.apache.drill.exec.expr.annotations.Param;
        +import org.apache.drill.exec.expr.annotations.Workspace;
        +import org.apache.drill.exec.expr.holders.*;
        +import org.apache.drill.exec.record.RecordBatch;
        +
        +@SuppressWarnings("unused")
        +
        +public class ${aggrtype.className}

        Functions {
        + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger($

        {aggrtype.className}Functions.class);
        +
        +<#list aggrtype.types as type>
        +
        +<#if aggrtype.aliasName == "">
        +@FunctionTemplate(name = "${aggrtype.funcName}", scope = FunctionTemplate.FunctionScope.POINT_AGGREGATE)
        +<#else>
        +@FunctionTemplate(names = {"${aggrtype.funcName}", "${aggrtype.aliasName}"}, scope = FunctionTemplate.FunctionScope.POINT_AGGREGATE)
        +</#if>
        +
        +public static class ${type.inputType}${aggrtype.className}

        implements DrillAggFunc{
        +
        + @Param $

        {type.inputType}

        Holder in;
        + @Workspace $

        {type.movingAverageType}Holder avg;
        + @Workspace ${type.movingDeviationType}Holder dev;
        + @Workspace ${type.countRunningType}Holder count;
        + @Output ${type.outputType}Holder out;
        +
        + public void setup(RecordBatch b) {
        + avg = new ${type.movingAverageType}

        Holder();
        + dev = new $

        {type.movingDeviationType}

        Holder();
        + count = new $

        {type.countRunningType}

        Holder();
        +
        + // Initialize the workspace variables
        + avg.value = 0;
        + dev.value = 0;
        + count.value = 1;
        + }
        +
        + @Override
        + public void add() {
        + <#if type.inputType?starts_with("Nullable")>
        + sout: {
        + if (in.isSet == 0)

        { + // processing nullable input and the value is null, so don't do anything... + break sout; + }

        + </#if>
        +
        + // Welford's approach to compute standard deviation
        — End diff –

        Welford's method does the computation online (streaming) and it looks simple... so I am wondering is there a catch ? It is computing the average each time a row is processed as opposed to doing it once at the end..so we would have to see how it performs.

        Show
        ASF GitHub Bot added a comment - Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/incubator-drill/pull/56#discussion_r12080233 — Diff: exec/java-exec/src/main/codegen/templates/AggrTypeFunctions3.java — @@ -0,0 +1,128 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +<@pp.dropOutputFile /> + + + +<#list aggrtypes2.aggrtypes as aggrtype> +<#if aggrtype.className != "Avg"> +<@pp.changeOutputFile name="/org/apache/drill/exec/expr/fn/impl/gaggr/$ {aggrtype.className}Functions.java" /> + +<#include "/@includes/license.ftl" /> + +<#-- A utility class that is used to generate java code for aggr functions such as stddev, variance --> + +/* + * This class is automatically generated from AggrTypeFunctions2.tdd using FreeMarker. + */ + +package org.apache.drill.exec.expr.fn.impl.gaggr; + +import org.apache.drill.exec.expr.DrillAggFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling; +import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.annotations.Workspace; +import org.apache.drill.exec.expr.holders.*; +import org.apache.drill.exec.record.RecordBatch; + +@SuppressWarnings("unused") + +public class ${aggrtype.className} Functions { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger($ {aggrtype.className}Functions.class); + +<#list aggrtype.types as type> + +<#if aggrtype.aliasName == ""> +@FunctionTemplate(name = "${aggrtype.funcName}", scope = FunctionTemplate.FunctionScope.POINT_AGGREGATE) +<#else> +@FunctionTemplate(names = {"${aggrtype.funcName}", "${aggrtype.aliasName}"}, scope = FunctionTemplate.FunctionScope.POINT_AGGREGATE) +</#if> + +public static class ${type.inputType}${aggrtype.className} implements DrillAggFunc{ + + @Param $ {type.inputType} Holder in; + @Workspace $ {type.movingAverageType}Holder avg; + @Workspace ${type.movingDeviationType}Holder dev; + @Workspace ${type.countRunningType}Holder count; + @Output ${type.outputType}Holder out; + + public void setup(RecordBatch b) { + avg = new ${type.movingAverageType} Holder(); + dev = new $ {type.movingDeviationType} Holder(); + count = new $ {type.countRunningType} Holder(); + + // Initialize the workspace variables + avg.value = 0; + dev.value = 0; + count.value = 1; + } + + @Override + public void add() { + <#if type.inputType?starts_with("Nullable")> + sout: { + if (in.isSet == 0) { + // processing nullable input and the value is null, so don't do anything... + break sout; + } + </#if> + + // Welford's approach to compute standard deviation — End diff – Welford's method does the computation online (streaming) and it looks simple... so I am wondering is there a catch ? It is computing the average each time a row is processed as opposed to doing it once at the end..so we would have to see how it performs.
        Hide
        ASF GitHub Bot added a comment -

        Github user amansinha100 commented on a diff in the pull request:

        https://github.com/apache/incubator-drill/pull/56#discussion_r12080141

        — Diff: exec/java-exec/src/main/codegen/templates/AggrTypeFunctions3.java —
        @@ -0,0 +1,128 @@
        +/**
        + * Licensed to the Apache Software Foundation (ASF) under one
        + * or more contributor license agreements. See the NOTICE file
        + * distributed with this work for additional information
        + * regarding copyright ownership. The ASF licenses this file
        + * to you under the Apache License, Version 2.0 (the
        + * "License"); you may not use this file except in compliance
        + * with the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +<@pp.dropOutputFile />
        +
        +
        +
        +<#list aggrtypes2.aggrtypes as aggrtype>
        +<#if aggrtype.className != "Avg">
        — End diff –

        Instead of checking for exclusion of Avg, it would be better to have a separate AggrThpyes3.tdd consisting of the new functions. The general idea was that AggrTypes1 contains aggr functions that have 1 running workspace variable, AggrType2 contains aggr functions that have 2 running workspace variables. Since stddev, variance have 3 running workspace variables, why not put them in their own tdd file ...

        Show
        ASF GitHub Bot added a comment - Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/incubator-drill/pull/56#discussion_r12080141 — Diff: exec/java-exec/src/main/codegen/templates/AggrTypeFunctions3.java — @@ -0,0 +1,128 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +<@pp.dropOutputFile /> + + + +<#list aggrtypes2.aggrtypes as aggrtype> +<#if aggrtype.className != "Avg"> — End diff – Instead of checking for exclusion of Avg, it would be better to have a separate AggrThpyes3.tdd consisting of the new functions. The general idea was that AggrTypes1 contains aggr functions that have 1 running workspace variable, AggrType2 contains aggr functions that have 2 running workspace variables. Since stddev, variance have 3 running workspace variables, why not put them in their own tdd file ...
        Mehant Baid made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Mehant Baid made changes -
        Field Original Value New Value
        Attachment DRILL-556.patch [ 12641375 ]
        Hide
        ASF GitHub Bot added a comment -

        GitHub user mehant opened a pull request:

        https://github.com/apache/incubator-drill/pull/56

        DRILL-556: Implement aggregate functions.

        Following aggregate functions are added.
        stddev()
        stddev_samp()
        stddev_pop()
        variance()
        var_samp()
        var_pop()

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/mehant/incubator-drill aggregate_functions

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/incubator-drill/pull/56.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #56


        commit e4665ec40a047253077540d07d2610b98b41f4fc
        Author: Mehant Baid <mehantr@gmail.com>
        Date: 2014-04-22T23:42:06Z

        DRILL-556: Implement the following aggregate functions.
        stddev()
        stddev_samp()
        stddev_pop()
        variance()
        var_samp()
        var_pop()


        Show
        ASF GitHub Bot added a comment - GitHub user mehant opened a pull request: https://github.com/apache/incubator-drill/pull/56 DRILL-556 : Implement aggregate functions. Following aggregate functions are added. stddev() stddev_samp() stddev_pop() variance() var_samp() var_pop() You can merge this pull request into a Git repository by running: $ git pull https://github.com/mehant/incubator-drill aggregate_functions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-drill/pull/56.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #56 commit e4665ec40a047253077540d07d2610b98b41f4fc Author: Mehant Baid <mehantr@gmail.com> Date: 2014-04-22T23:42:06Z DRILL-556 : Implement the following aggregate functions. stddev() stddev_samp() stddev_pop() variance() var_samp() var_pop()
        Mehant Baid created issue -

          People

          • Assignee:
            Mehant Baid
            Reporter:
            Mehant Baid
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development