Hive
  1. Hive
  2. HIVE-287

support count(*) and count distinct on multiple columns

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.6.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      This change deprecates the GenericUDAFResolver interface in favor of the new GenericUDAFResolver2.

      Description

      The following query does not work:

      select count(distinct col1, col2) from Tbl

      1. HIVE-287-1.patch
        34 kB
        Arvind Prabhakar
      2. HIVE-287-2.patch
        36 kB
        Arvind Prabhakar
      3. HIVE-287-3.patch
        34 kB
        Arvind Prabhakar
      4. HIVE-287-4.patch
        41 kB
        Arvind Prabhakar
      5. HIVE-287-5-branch-0.6.patch
        53 kB
        Arvind Prabhakar
      6. HIVE-287-5-trunk.patch
        53 kB
        Arvind Prabhakar
      7. HIVE-287-6-branch-0.6.patch
        55 kB
        Arvind Prabhakar
      8. HIVE-287-6-trunk.patch
        55 kB
        Arvind Prabhakar

        Issue Links

          Activity

          Hide
          Jeff Hammerbacher added a comment -

          Okay I spoke too soon. Turns out the Apache wiki just takes about 5 minutes to save an edit; the disclaimers are there now.

          Show
          Jeff Hammerbacher added a comment - Okay I spoke too soon. Turns out the Apache wiki just takes about 5 minutes to save an edit; the disclaimers are there now.
          Hide
          Jeff Hammerbacher added a comment -

          We ran into some users who were using older versions of Hive (including the version of Hive distributed with CDH3b2) and were confused by the COUNT in the documentation. I've tried to change the wiki but it does not appear to be saving my changes. If someone gets the chance, I'd love to see a disclaimer like "Note that for versions of Hive which don't include [https://issues.apache.org/jira/browse/HIVE-287], you'll need to use COUNT(1) in place of COUNT." added to the wiki pages that reference COUNT.

          Show
          Jeff Hammerbacher added a comment - We ran into some users who were using older versions of Hive (including the version of Hive distributed with CDH3b2) and were confused by the COUNT in the documentation. I've tried to change the wiki but it does not appear to be saving my changes. If someone gets the chance, I'd love to see a disclaimer like "Note that for versions of Hive which don't include [ https://issues.apache.org/jira/browse/HIVE-287 ], you'll need to use COUNT(1) in place of COUNT ." added to the wiki pages that reference COUNT .
          Hide
          Arvind Prabhakar added a comment -

          Updated the wiki in the all the above places.

          Show
          Arvind Prabhakar added a comment - Updated the wiki in the all the above places.
          Hide
          John Sichi added a comment -
          Show
          John Sichi added a comment - Searching for "count" in wiki, I think examples in these pages should change to count : http://wiki.apache.org/hadoop/Hive/LanguageManual/GroupBy http://wiki.apache.org/hadoop/Hive/GettingStarted http://wiki.apache.org/hadoop/Hive/Tutorial
          Hide
          Arvind Prabhakar added a comment -

          @John: I updated the UDAF documentation on the wiki at http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF and also added a short blurb regarding the various interface changes on the UDAF tutorial page http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy#Writing_the_source. Please let me know if there are other places that need to be updated as well.

          Show
          Arvind Prabhakar added a comment - @John: I updated the UDAF documentation on the wiki at http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF and also added a short blurb regarding the various interface changes on the UDAF tutorial page http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy#Writing_the_source . Please let me know if there are other places that need to be updated as well.
          Hide
          John Sichi added a comment -

          Followup created in HIVE-1459. Arvind, this is a good one to work on next.

          Show
          John Sichi added a comment - Followup created in HIVE-1459 . Arvind, this is a good one to work on next.
          Hide
          John Sichi added a comment -

          Arvind, could you update relevant documentation in wiki?

          Besides the SQL reference, Mayank Lahiri is working on some documentation for writing new UDAF's, so the resolver interface can probably go there:

          http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy

          Show
          John Sichi added a comment - Arvind, could you update relevant documentation in wiki? Besides the SQL reference, Mayank Lahiri is working on some documentation for writing new UDAF's, so the resolver interface can probably go there: http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy
          Hide
          John Sichi added a comment -

          Committed to trunk and 0.6. Thanks Arvind!

          Show
          John Sichi added a comment - Committed to trunk and 0.6. Thanks Arvind!
          Hide
          John Sichi added a comment -

          +1. Will commit if tests pass.

          Show
          John Sichi added a comment - +1. Will commit if tests pass.
          Hide
          HBase Review Board added a comment -

          Message from: "John Sichi" <jsichi@facebook.com>

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/290/
          -----------------------------------------------------------

          Review request for Hive Developers.

          Summary
          -------

          (hopefully) final review by JVS

          This addresses bug HIVE-287.
          http://issues.apache.org/jira/browse/HIVE-287

          Diffs


          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/AbstractGenericUDAFResolver.java PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFAverage.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFBridge.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMax.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMin.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFParameterInfo.java PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver2.java PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVariance.java 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/SimpleGenericUDAFParameterInfo.java PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_avg_syntax.q PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_max_syntax.q PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_min_syntax.q PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_std_syntax.q PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_stddev_samp_syntax.q PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_sum_syntax.q PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_var_samp_syntax.q PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_variance_syntax.q PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientpositive/udf_count.q 961987
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_avg_syntax.q.out PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_max_syntax.q.out PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_min_syntax.q.out PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_std_syntax.q.out PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_stddev_samp_syntax.q.out PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_sum_syntax.q.out PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_var_samp_syntax.q.out PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_variance_syntax.q.out PRE-CREATION
          http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientpositive/udf_count.q.out 961987

          Diff: http://review.hbase.org/r/290/diff

          Testing
          -------

          Thanks,

          John

          Show
          HBase Review Board added a comment - Message from: "John Sichi" <jsichi@facebook.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/290/ ----------------------------------------------------------- Review request for Hive Developers. Summary ------- (hopefully) final review by JVS This addresses bug HIVE-287 . http://issues.apache.org/jira/browse/HIVE-287 Diffs http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/AbstractGenericUDAFResolver.java PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFAverage.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFBridge.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMax.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMin.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFParameterInfo.java PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver2.java PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVariance.java 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/SimpleGenericUDAFParameterInfo.java PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_avg_syntax.q PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_max_syntax.q PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_min_syntax.q PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_std_syntax.q PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_stddev_samp_syntax.q PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_sum_syntax.q PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_var_samp_syntax.q PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientnegative/invalid_variance_syntax.q PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/queries/clientpositive/udf_count.q 961987 http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_avg_syntax.q.out PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_max_syntax.q.out PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_min_syntax.q.out PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_std_syntax.q.out PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_stddev_samp_syntax.q.out PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_sum_syntax.q.out PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_var_samp_syntax.q.out PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientnegative/invalid_variance_syntax.q.out PRE-CREATION http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/src/test/results/clientpositive/udf_count.q.out 961987 Diff: http://review.hbase.org/r/290/diff Testing ------- Thanks, John
          Hide
          Arvind Prabhakar added a comment -

          1. Change the comments for the 2 new fields. It's easy for UDAF writers to assume that the UDAF itself needs to handle whether it's distinct or whether it's all columns.

          I updated the javadocs in various places to make this clear.

          2. Deprecate the old interface, and move all existing GenericUDAF to inherit from the new one.

          No changes necessary for this - the previously submitted patch also did it. All existing generic UDAFs now extend from the abstract class that implements the new interface. If you see any problems with that let me know.

          Show
          Arvind Prabhakar added a comment - 1. Change the comments for the 2 new fields. It's easy for UDAF writers to assume that the UDAF itself needs to handle whether it's distinct or whether it's all columns. I updated the javadocs in various places to make this clear. 2. Deprecate the old interface, and move all existing GenericUDAF to inherit from the new one. No changes necessary for this - the previously submitted patch also did it. All existing generic UDAFs now extend from the abstract class that implements the new interface. If you see any problems with that let me know.
          Hide
          Zheng Shao added a comment -

          Talked with John offline also.

          I agree that we can use the new interface going forward. Can you do these also in this patch:
          1. Change the comments for the 2 new fields. It's easy for UDAF writers to assume that the UDAF itself needs to handle whether it's distinct or whether it's all columns.
          2. Deprecate the old interface, and move all existing GenericUDAF to inherit from the new one.

          +  /**
          +   * @return true if the UDAF invocation was qualified with <tt>DISTINCT</tt>
          +   * keyword, false otherwise.
          +   */
          +  boolean isDistinct();
          +
          +  /**
          +   * @return true if the UDAF invocation was done with a wildcard instead of
          +   * explicit parameter list.
          +   */
          +  boolean isAllColumns();
          

          After this patch is in, here is a list of follow-ups. Can you open JIRA for these:

          1. Let UDAF and UDF support * and regex-based column specification
          2. Special-case COUNT because that does not require reading any columns, while MY_UDAF needs all columns.

          Show
          Zheng Shao added a comment - Talked with John offline also. I agree that we can use the new interface going forward. Can you do these also in this patch: 1. Change the comments for the 2 new fields. It's easy for UDAF writers to assume that the UDAF itself needs to handle whether it's distinct or whether it's all columns. 2. Deprecate the old interface, and move all existing GenericUDAF to inherit from the new one. + /** + * @ return true if the UDAF invocation was qualified with <tt>DISTINCT</tt> + * keyword, false otherwise. + */ + boolean isDistinct(); + + /** + * @ return true if the UDAF invocation was done with a wildcard instead of + * explicit parameter list. + */ + boolean isAllColumns(); After this patch is in, here is a list of follow-ups. Can you open JIRA for these: 1. Let UDAF and UDF support * and regex-based column specification 2. Special-case COUNT because that does not require reading any columns, while MY_UDAF needs all columns.
          Hide
          Arvind Prabhakar added a comment -

          I think keeping two different interfaces for UDAFs will lead to confusion in the long run. Thats why the current patch deprecates the old interface in favor of the new one. But if all agree that it is a good idea, then I will go with that.

          Also - can you suggest an alternate name for the new interface?

          Show
          Arvind Prabhakar added a comment - I think keeping two different interfaces for UDAFs will lead to confusion in the long run. Thats why the current patch deprecates the old interface in favor of the new one. But if all agree that it is a good idea, then I will go with that. Also - can you suggest an alternate name for the new interface?
          Hide
          Zheng Shao added a comment -

          The plan looks good to me.

          Just one comment: I think we should change the comment/class name for GenericUDAFResolver2. Let's explicitly say GenericUDAFResolver2 is for UDAFs that want to have control over whether DISTINCT or * should be treated differently. For normal UDAFs, they should still inherite from GenericUDAFResolver.

          Does that sound OK?

          Show
          Zheng Shao added a comment - The plan looks good to me. Just one comment: I think we should change the comment/class name for GenericUDAFResolver2. Let's explicitly say GenericUDAFResolver2 is for UDAFs that want to have control over whether DISTINCT or * should be treated differently. For normal UDAFs, they should still inherite from GenericUDAFResolver. Does that sound OK?
          Hide
          John Sichi added a comment -

          I'll see if I can clear it up with Zheng today, and if so, we can just commit as is; otherwise let's meet.

          Show
          John Sichi added a comment - I'll see if I can clear it up with Zheng today, and if so, we can just commit as is; otherwise let's meet.
          Hide
          Arvind Prabhakar added a comment -

          I vote for a meeting to hash this out face-to-face. I am willing to modify the patch provided we all are in agreement as to how it should be changed. It will be much better use of everyone's time to avoid the numerous deltas to the patch before settling in on the final solution. Please let me know what you think.

          Show
          Arvind Prabhakar added a comment - I vote for a meeting to hash this out face-to-face. I am willing to modify the patch provided we all are in agreement as to how it should be changed. It will be much better use of everyone's time to avoid the numerous deltas to the patch before settling in on the final solution. Please let me know what you think.
          Hide
          John Sichi added a comment -

          Regarding DISTINCT: I agree with Arvind; this information should be provided to the UDAF so that it can reject invocations that don't make sense. Once this validation is passed, the distinct elimination is still implemented generically inside of Hive (upstream of the UDAF).

          Regarding F: let's discriminate three cases.

          COUNT: this really means COUNT(), not COUNT(x,y,z). This is a very important distinction to make from an optimizer perspective, because we want to be able to push down projection to avoid I/O and other processing for columns whose values we will never look at.

          SUM and similar ones: these we should disallow.

          MY_UDAF, or MY_UDAF(t.*): this is similar to Pradeep's case that came up recently on the mailing list, and it needs to expand to MY_UDAF(x,y,z), not MY_UDAF(). I think the patch is currently doing MY_UDAF(), which isn't what he wants.

          My recommendation is that we commit Arvind's patch as is, then create a followup JIRA issue to do what Pradeep is looking for (the expansion of * in the semantic analyzer) for both UDF and UDAF, but with a special case for COUNT. UDAF authors will be able to decide whether or not to reject the star syntax, since in the common case of a UDAF expecting a limited number of parameters, the star won't make sense.

          Show
          John Sichi added a comment - Regarding DISTINCT: I agree with Arvind; this information should be provided to the UDAF so that it can reject invocations that don't make sense. Once this validation is passed, the distinct elimination is still implemented generically inside of Hive (upstream of the UDAF). Regarding F : let's discriminate three cases. COUNT : this really means COUNT(), not COUNT(x,y,z). This is a very important distinction to make from an optimizer perspective, because we want to be able to push down projection to avoid I/O and other processing for columns whose values we will never look at. SUM and similar ones: these we should disallow. MY_UDAF , or MY_UDAF(t.*): this is similar to Pradeep's case that came up recently on the mailing list, and it needs to expand to MY_UDAF(x,y,z), not MY_UDAF(). I think the patch is currently doing MY_UDAF(), which isn't what he wants. My recommendation is that we commit Arvind's patch as is, then create a followup JIRA issue to do what Pradeep is looking for (the expansion of * in the semantic analyzer) for both UDF and UDAF, but with a special case for COUNT. UDAF authors will be able to decide whether or not to reject the star syntax, since in the common case of a UDAF expecting a limited number of parameters, the star won't make sense.
          Hide
          Arvind Prabhakar added a comment -

          @Zheng: Welcome to the party.

          Why do we put the DISTINCT in the information? DISTINCT is currently done by the framework, instead of individual UDAF. This is good because the logic of removing duplicates are common for all UDAFs. We do support SUM(DISTINCT val).

          Providing the information in the parameter specification is not the same as enforcing its interpretation. This is provided primarily to ensure that UDAFs that rely on this information can make appropriate decisions. For example, we wanted to disallow the invocation COUNT( EXPR1, EXPR2 ...) in favor of COUNT(DISTINCT EXPR1, EXPR2 ...). Without this information, the count UDAF will not be able to enforce the later syntax.

          Why do we special-case ""? It seems to me that "" is just a short-cut. Hive already supports regex-based multi-column specification, so that we can say `abc.*` for all columns with name starting with abc. The compiler should just expand * and give all the columns to the UDAF.

          If you wish to use * as a regular expression, you would have to quote it as a string - COUNT('*'). This is different from the invocation as specified in SQL which treats * as a terminal symbol. So if it is OK to deviate from the standard representation, the user can easily use the quoted string representation to achieve the effect similar to COUNT(col1, col2 ..). The semantics of this should be more like COUNT(DISTINCT EXPR1, EXPR2 ...) as opposed to COUNT(*).

          Since COUNT(*) is a special-case in the SQL standard (COUNT(*) is different from COUNT(col) even if the table has a single column col), I think we should just special-case that and replace that with count(1) at some place.

          Are you suggesting that we allow the grammar to express COUNT(*) syntax, but in the lexical analysis stage turn it into a COUNT(1)? I can see how that may work - but personally I am not a fan of such an approach.

          Show
          Arvind Prabhakar added a comment - @Zheng: Welcome to the party. Why do we put the DISTINCT in the information? DISTINCT is currently done by the framework, instead of individual UDAF. This is good because the logic of removing duplicates are common for all UDAFs. We do support SUM(DISTINCT val). Providing the information in the parameter specification is not the same as enforcing its interpretation. This is provided primarily to ensure that UDAFs that rely on this information can make appropriate decisions. For example, we wanted to disallow the invocation COUNT( EXPR1, EXPR2 ...) in favor of COUNT( DISTINCT EXPR1, EXPR2 ...) . Without this information, the count UDAF will not be able to enforce the later syntax. Why do we special-case ""? It seems to me that "" is just a short-cut. Hive already supports regex-based multi-column specification, so that we can say `abc.*` for all columns with name starting with abc. The compiler should just expand * and give all the columns to the UDAF. If you wish to use * as a regular expression, you would have to quote it as a string - COUNT('*') . This is different from the invocation as specified in SQL which treats * as a terminal symbol. So if it is OK to deviate from the standard representation, the user can easily use the quoted string representation to achieve the effect similar to COUNT(col1, col2 ..) . The semantics of this should be more like COUNT(DISTINCT EXPR1, EXPR2 ...) as opposed to COUNT(*) . Since COUNT(*) is a special-case in the SQL standard (COUNT(*) is different from COUNT(col) even if the table has a single column col), I think we should just special-case that and replace that with count(1) at some place. Are you suggesting that we allow the grammar to express COUNT(*) syntax, but in the lexical analysis stage turn it into a COUNT(1) ? I can see how that may work - but personally I am not a fan of such an approach.
          Hide
          Zheng Shao added a comment -

          Hi Arvind, sorry for coming late for the party. I have 2 questions on the new UDAF2 interface:

          1. Why do we put the DISTINCT in the information? DISTINCT is currently done by the framework, instead of individual UDAF.
          This is good because the logic of removing duplicates are common for all UDAFs. We do support SUM(DISTINCT val).

          2. Why do we special-case ""? It seems to me that "" is just a short-cut. Hive already supports regex-based multi-column specification, so that we can say `abc.*` for all columns with name starting with abc. The compiler should just expand * and give all the columns to the UDAF.

          Since COUNT is a special-case in the SQL standard (COUNT is different from COUNT(col) even if the table has a single column col), I think we should just special-case that and replace that with count(1) at some place.

          What do you think?

          Show
          Zheng Shao added a comment - Hi Arvind, sorry for coming late for the party. I have 2 questions on the new UDAF2 interface: 1. Why do we put the DISTINCT in the information? DISTINCT is currently done by the framework, instead of individual UDAF. This is good because the logic of removing duplicates are common for all UDAFs. We do support SUM(DISTINCT val). 2. Why do we special-case " "? It seems to me that " " is just a short-cut. Hive already supports regex-based multi-column specification, so that we can say `abc.*` for all columns with name starting with abc. The compiler should just expand * and give all the columns to the UDAF. Since COUNT is a special-case in the SQL standard (COUNT is different from COUNT(col) even if the table has a single column col), I think we should just special-case that and replace that with count(1) at some place. What do you think?
          Hide
          HBase Review Board added a comment -

          Message from: "Carl Steinbach" <carl@cloudera.com>

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.hbase.org/r/275/
          -----------------------------------------------------------

          (Updated 2010-07-06 19:24:53.267440)

          Review request for Hive Developers and John Sichi.

          Changes
          -------

          Updated 'Bugs' field in review request.

          Summary
          -------

          This is the patch revision 5 for HIVE-287. Please see the Jira comments for full description.

          This addresses bug HIVE-287.
          http://issues.apache.org/jira/browse/HIVE-287

          Diffs


          ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 4109103
          ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 7e6e63e
          ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 252b89d
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/AbstractGenericUDAFResolver.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFAverage.java 0ef4734
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFBridge.java 916eb33
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java 0054664
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java 8d22ef1
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMax.java 6785687
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMin.java 051f3a1
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFParameterInfo.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver.java 9888b52
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver2.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java ce97afd
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVariance.java 26dc84c
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/SimpleGenericUDAFParameterInfo.java PRE-CREATION
          ql/src/test/queries/clientnegative/invalid_avg_syntax.q PRE-CREATION
          ql/src/test/queries/clientnegative/invalid_max_syntax.q PRE-CREATION
          ql/src/test/queries/clientnegative/invalid_min_syntax.q PRE-CREATION
          ql/src/test/queries/clientnegative/invalid_std_syntax.q PRE-CREATION
          ql/src/test/queries/clientnegative/invalid_stddev_samp_syntax.q PRE-CREATION
          ql/src/test/queries/clientnegative/invalid_sum_syntax.q PRE-CREATION
          ql/src/test/queries/clientnegative/invalid_var_samp_syntax.q PRE-CREATION
          ql/src/test/queries/clientnegative/invalid_variance_syntax.q PRE-CREATION
          ql/src/test/queries/clientpositive/udf_count.q 2d1510f
          ql/src/test/results/clientnegative/invalid_avg_syntax.q.out PRE-CREATION
          ql/src/test/results/clientnegative/invalid_max_syntax.q.out PRE-CREATION
          ql/src/test/results/clientnegative/invalid_min_syntax.q.out PRE-CREATION
          ql/src/test/results/clientnegative/invalid_std_syntax.q.out PRE-CREATION
          ql/src/test/results/clientnegative/invalid_stddev_samp_syntax.q.out PRE-CREATION
          ql/src/test/results/clientnegative/invalid_sum_syntax.q.out PRE-CREATION
          ql/src/test/results/clientnegative/invalid_var_samp_syntax.q.out PRE-CREATION
          ql/src/test/results/clientnegative/invalid_variance_syntax.q.out PRE-CREATION
          ql/src/test/results/clientpositive/udf_count.q.out 2adffd8

          Diff: http://review.hbase.org/r/275/diff

          Testing
          -------

          Ran all tests on trunk.

          Thanks,

          Arvind

          Show
          HBase Review Board added a comment - Message from: "Carl Steinbach" <carl@cloudera.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/275/ ----------------------------------------------------------- (Updated 2010-07-06 19:24:53.267440) Review request for Hive Developers and John Sichi. Changes ------- Updated 'Bugs' field in review request. Summary ------- This is the patch revision 5 for HIVE-287 . Please see the Jira comments for full description. This addresses bug HIVE-287 . http://issues.apache.org/jira/browse/HIVE-287 Diffs ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 4109103 ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 7e6e63e ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 252b89d ql/src/java/org/apache/hadoop/hive/ql/udf/generic/AbstractGenericUDAFResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFAverage.java 0ef4734 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFBridge.java 916eb33 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java 0054664 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java 8d22ef1 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMax.java 6785687 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMin.java 051f3a1 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFParameterInfo.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver.java 9888b52 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFResolver2.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java ce97afd ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVariance.java 26dc84c ql/src/java/org/apache/hadoop/hive/ql/udf/generic/SimpleGenericUDAFParameterInfo.java PRE-CREATION ql/src/test/queries/clientnegative/invalid_avg_syntax.q PRE-CREATION ql/src/test/queries/clientnegative/invalid_max_syntax.q PRE-CREATION ql/src/test/queries/clientnegative/invalid_min_syntax.q PRE-CREATION ql/src/test/queries/clientnegative/invalid_std_syntax.q PRE-CREATION ql/src/test/queries/clientnegative/invalid_stddev_samp_syntax.q PRE-CREATION ql/src/test/queries/clientnegative/invalid_sum_syntax.q PRE-CREATION ql/src/test/queries/clientnegative/invalid_var_samp_syntax.q PRE-CREATION ql/src/test/queries/clientnegative/invalid_variance_syntax.q PRE-CREATION ql/src/test/queries/clientpositive/udf_count.q 2d1510f ql/src/test/results/clientnegative/invalid_avg_syntax.q.out PRE-CREATION ql/src/test/results/clientnegative/invalid_max_syntax.q.out PRE-CREATION ql/src/test/results/clientnegative/invalid_min_syntax.q.out PRE-CREATION ql/src/test/results/clientnegative/invalid_std_syntax.q.out PRE-CREATION ql/src/test/results/clientnegative/invalid_stddev_samp_syntax.q.out PRE-CREATION ql/src/test/results/clientnegative/invalid_sum_syntax.q.out PRE-CREATION ql/src/test/results/clientnegative/invalid_var_samp_syntax.q.out PRE-CREATION ql/src/test/results/clientnegative/invalid_variance_syntax.q.out PRE-CREATION ql/src/test/results/clientpositive/udf_count.q.out 2adffd8 Diff: http://review.hbase.org/r/275/diff Testing ------- Ran all tests on trunk. Thanks, Arvind
          Hide
          Arvind Prabhakar added a comment -

          Review board review posted:

          http://review.hbase.org/r/275/

          Show
          Arvind Prabhakar added a comment - Review board review posted: http://review.hbase.org/r/275/
          Hide
          Arvind Prabhakar added a comment -

          Uploaded patch for trunk and branch-0.6. Ran all the tests on trunk and did spot testing on branch-0.6.

          Changes from Previous patch:

          • Modified the implementation of AbstractGenericUDAFResolver to raise an exception when invoked with the UDAF(STAR) syntax.
          • Added negative test cases to assert that the current UDAFs present in the code other than COUNT do not accept the UDAF(STAR) syntax.
          • Added EXPLAIN directives for the queries run in udf_count.q test file.

          Will attempt to post the patch on review board as well.

          Show
          Arvind Prabhakar added a comment - Uploaded patch for trunk and branch-0.6. Ran all the tests on trunk and did spot testing on branch-0.6. Changes from Previous patch: Modified the implementation of AbstractGenericUDAFResolver to raise an exception when invoked with the UDAF(STAR) syntax. Added negative test cases to assert that the current UDAFs present in the code other than COUNT do not accept the UDAF(STAR) syntax. Added EXPLAIN directives for the queries run in udf_count.q test file. Will attempt to post the patch on review board as well.
          Hide
          John Sichi added a comment -

          @Ashish: given where we are currently, the keyword concern is valid.

          But, I think we might be doing our users a disservice in the long run letting them use standard SQL reserved keywords as unquoted identifiers. This just causes interoperability confusion since most other systems don't support this, and it makes data portability harder.

          (I'm not proposing any immediate action on this, just pointing it out.)

          Show
          John Sichi added a comment - @Ashish: given where we are currently, the keyword concern is valid. But, I think we might be doing our users a disservice in the long run letting them use standard SQL reserved keywords as unquoted identifiers. This just causes interoperability confusion since most other systems don't support this, and it makes data portability harder. (I'm not proposing any immediate action on this, just pointing it out.)
          Hide
          John Sichi added a comment -

          A couple of DBMS's I just tried actually do use something equivalent to that grammar snippet I quoted.

          mysql> select sum(*) from TBLS;
          ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '*) from TBLS' at line 1
          
          0: jdbc:luciddb:> select sum(*) from sys_root.dba_tables;  
          Error: At line 1, column 12: Unknown identifier '*' (state=,code=0)
          

          But I'm fine with leaving the grammar as B and doing the check inside of SemanticAnalyzer instead as long as your negative tests demonstrate that the diagnostics are good if a user tries something like sum.

          However, for this part:
          "Since you feel strongly about not supporting this syntax by default for any function, I can perhaps modify the AbstractGenericUDAFResolver class to raise an exception if invoked with this syntax. That way, only the functions that choose to overwrite that behavior will be able to support it."

          It's not just a matter of not supporting it by default. I don't think we should support it at all, so let's prevent inside of SemanticAnalyzer and then not expose it at the resolver level at all. If in early days, the SQL people had just done the smart thing and made the "count all" expression COUNT() instead of COUNT, we wouldn't even be having this discussion.

          Show
          John Sichi added a comment - A couple of DBMS's I just tried actually do use something equivalent to that grammar snippet I quoted. mysql> select sum(*) from TBLS; ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '*) from TBLS' at line 1 0: jdbc:luciddb:> select sum(*) from sys_root.dba_tables; Error: At line 1, column 12: Unknown identifier '*' (state=,code=0) But I'm fine with leaving the grammar as B and doing the check inside of SemanticAnalyzer instead as long as your negative tests demonstrate that the diagnostics are good if a user tries something like sum . However, for this part: "Since you feel strongly about not supporting this syntax by default for any function, I can perhaps modify the AbstractGenericUDAFResolver class to raise an exception if invoked with this syntax. That way, only the functions that choose to overwrite that behavior will be able to support it." It's not just a matter of not supporting it by default. I don't think we should support it at all, so let's prevent inside of SemanticAnalyzer and then not expose it at the resolver level at all. If in early days, the SQL people had just done the smart thing and made the "count all" expression COUNT() instead of COUNT , we wouldn't even be having this discussion.
          Hide
          Ashish Thusoo added a comment -

          @John

          Another disadvantage for doing C that I can think of is the fact that count would become a keyword and then any column names would have to be quoted. Not a big deal but just something that would be a side effect of going with C.

          Show
          Ashish Thusoo added a comment - @John Another disadvantage for doing C that I can think of is the fact that count would become a keyword and then any column names would have to be quoted. Not a big deal but just something that would be a side effect of going with C.
          Hide
          Arvind Prabhakar added a comment -

          Thanks for the explanation John. The SQL BNF that you pointed out is the normative SQL specification. I do not think any SQL implementations use this grammar though. The parallel is that of an interface and its implementation. While the interface can be short and precise, the implementations may choose to delegate interface methods to other implementation specific methods. Similarly, most databases deal with their own SQL grammar that is compliant with the SQL standard at specific levels.

          More to the point in Hive - my key concern is that by modifying the grammar to make an exception for COUNT, we will be introducing a brittle coupling between the the parser and semantic analyzer. Right now the count aggregate function is treated like any other function and is thus part of the general framework. By making this change, we will be modifying it to be specifically associated from with the grammar directives.

          This is the current function definition in Hive QL grammar (A):

          -->[ functionName ]-->[ LPAREN ]--+-->[ KW_DISTINCT ]--+--+--+-->[ expression ]--+--+-->[ RPAREN ]-->
                                            |                    |  |  |                   |  |
                                            +----------->--------+  |  +------[ COMMA ]<---+  |
                                                                    |                         |
                                                                    +------------->-----------+
          

          The patch that I have supplied already on this Jira modifies this definition as follows (B):

          -->[ functionName ]-->[ LPAREN ]--+------------------------>[ STAR ]----------------------+-->[ RPAREN ]-->
                                            |                                                       |
                                            +--+-->[ KW_DISTINCT ]--+--+--+-->[ expression ]--+--+--+
                                               |                    |  |  |                   |  |
                                               +----------->--------+  |  +------[ COMMA ]<---+  |
                                                                       |                         |
                                                                       +------------->-----------+
          

          If I were to modify the grammar to make an exception for COUNT it will likely be changed to something like this (C):

          --+---------------------------------->[ KW_COUNT ]-->[ LPAREN ]-->[ STAR ]-->[ RPAREN ]----------------+-->
            |                                                                                                    |
            +-->[ functionName ]-->[ LPAREN ]--+--+-->[ KW_DISTINCT ]--+--+-->[ expression ]--+--+-->[ RPAREN ]--+
                                               |  |                    |  |                   |  |
                                               |  +----------->--------+  +----[ COMMA ]<-----+  |
                                               |                                                 |
                                               +------------------------->-----------------------+
          
          

          Consider the C approach closely: The production that matches a COUNT invocation can be directly matched via the top branch using KW_COUNT token, or it could follow the branch below where functionName could match COUNT. On the semantic analyzer side, it makes the matching logic more complex and less intuitive since now the COUNT can be invoked via two branch conditions. For example - there would be one invocation that would directly delegate to the COUNT aggregate function, whereas another that will use the current resolver mechanism to invoke it.

          Instead, the approach B keeps the grammar consistent with the regular function invocation. It does not favor any one function over the other and simply establishes matching rules for function production. That way, the call is then delegated to the semantic analyzer which in turn matches the appropriate handling function based on the name and parameter type using the generic resolver mechanism without regard to what function is being invoked.

          The changes supplied in the current patch also allow individual function handlers to decide if they would like to support the functionName(STAR) syntax. Since you feel strongly about not supporting this syntax by default for any function, I can perhaps modify the AbstractGenericUDAFResolver class to raise an exception if invoked with this syntax. That way, only the functions that choose to overwrite that behavior will be able to support it. Also as you can see in the syntax diagram for (B), there is no production that will match things like functionName(DISTINCT STAR) or functionName(STAR, EXPR) etc.

          That said, I am open to going either way. I just wanted to put my point across so that you could consider it and make an informed decision. Let me know what you think.

          Show
          Arvind Prabhakar added a comment - Thanks for the explanation John. The SQL BNF that you pointed out is the normative SQL specification. I do not think any SQL implementations use this grammar though. The parallel is that of an interface and its implementation. While the interface can be short and precise, the implementations may choose to delegate interface methods to other implementation specific methods. Similarly, most databases deal with their own SQL grammar that is compliant with the SQL standard at specific levels. More to the point in Hive - my key concern is that by modifying the grammar to make an exception for COUNT , we will be introducing a brittle coupling between the the parser and semantic analyzer. Right now the count aggregate function is treated like any other function and is thus part of the general framework. By making this change, we will be modifying it to be specifically associated from with the grammar directives. This is the current function definition in Hive QL grammar ( A ): -->[ functionName ]-->[ LPAREN ]--+-->[ KW_DISTINCT ]--+--+--+-->[ expression ]--+--+-->[ RPAREN ]--> | | | | | | +----------->--------+ | +------[ COMMA ]<---+ | | | +------------->-----------+ The patch that I have supplied already on this Jira modifies this definition as follows ( B ): -->[ functionName ]-->[ LPAREN ]--+------------------------>[ STAR ]----------------------+-->[ RPAREN ]--> | | +--+-->[ KW_DISTINCT ]--+--+--+-->[ expression ]--+--+--+ | | | | | | +----------->--------+ | +------[ COMMA ]<---+ | | | +------------->-----------+ If I were to modify the grammar to make an exception for COUNT it will likely be changed to something like this ( C ): --+---------------------------------->[ KW_COUNT ]-->[ LPAREN ]-->[ STAR ]-->[ RPAREN ]----------------+--> | | +-->[ functionName ]-->[ LPAREN ]--+--+-->[ KW_DISTINCT ]--+--+-->[ expression ]--+--+-->[ RPAREN ]--+ | | | | | | | +----------->--------+ +----[ COMMA ]<-----+ | | | +------------------------->-----------------------+ Consider the C approach closely: The production that matches a COUNT invocation can be directly matched via the top branch using KW_COUNT token, or it could follow the branch below where functionName could match COUNT . On the semantic analyzer side, it makes the matching logic more complex and less intuitive since now the COUNT can be invoked via two branch conditions. For example - there would be one invocation that would directly delegate to the COUNT aggregate function, whereas another that will use the current resolver mechanism to invoke it. Instead, the approach B keeps the grammar consistent with the regular function invocation. It does not favor any one function over the other and simply establishes matching rules for function production. That way, the call is then delegated to the semantic analyzer which in turn matches the appropriate handling function based on the name and parameter type using the generic resolver mechanism without regard to what function is being invoked. The changes supplied in the current patch also allow individual function handlers to decide if they would like to support the functionName(STAR) syntax. Since you feel strongly about not supporting this syntax by default for any function, I can perhaps modify the AbstractGenericUDAFResolver class to raise an exception if invoked with this syntax. That way, only the functions that choose to overwrite that behavior will be able to support it. Also as you can see in the syntax diagram for (B), there is no production that will match things like functionName(DISTINCT STAR) or functionName(STAR, EXPR) etc. That said, I am open to going either way. I just wanted to put my point across so that you could consider it and make an informed decision. Let me know what you think.
          Hide
          John Sichi added a comment - - edited

          Yes, if you can do get it to work in the grammar itself (via a choice in the existing "function" rule), that would be best. If it it's not possible to do it there for some reason, then semantic analyzer.

          Could you explain what you mean regarding impact? Since Hive doesn't support COUNT before your patch, the only impact is on the rest of your patch, right?

          BTW, here's the relevant BNF from my copy of the SQL:2003 standard (ISO/IEC 9075-2:2003 part 2 section 10.9), omitting some SQL/OLAP stuff such as <filter clause>:

          <aggregate function > ::=
          COUNT <left paren> <asterisk> <right paren>
          | <general set function>
          
          <general set function> ::=
          <set function type> <left paren> [ <set quantifier> ] <value expression> <right paren>
          
          <set function type> ::=
          AVG |
          MAX |
          MIN |
          SUM |
          ...
          
          <set quantifier> ::= DISTINCT | ALL
          

          As you can see, they make a special case for COUNT to allow for star there alone, and they don't allow COUNT(DISTINCT *).

          Show
          John Sichi added a comment - - edited Yes, if you can do get it to work in the grammar itself (via a choice in the existing "function" rule), that would be best. If it it's not possible to do it there for some reason, then semantic analyzer. Could you explain what you mean regarding impact? Since Hive doesn't support COUNT before your patch, the only impact is on the rest of your patch, right? BTW, here's the relevant BNF from my copy of the SQL:2003 standard (ISO/IEC 9075-2:2003 part 2 section 10.9), omitting some SQL/OLAP stuff such as <filter clause>: <aggregate function > ::= COUNT <left paren> <asterisk> <right paren> | <general set function> <general set function> ::= <set function type> <left paren> [ <set quantifier> ] <value expression> <right paren> <set function type> ::= AVG | MAX | MIN | SUM | ... <set quantifier> ::= DISTINCT | ALL As you can see, they make a special case for COUNT to allow for star there alone, and they don't allow COUNT(DISTINCT *).
          Hide
          Arvind Prabhakar added a comment -

          @John - are you suggesting that the grammar be updated to restrict single star argument with the specific function COUNT? If not in the grammar where else do you think these restrictions should be coded.

          In either case, what other subsystems you think will be impacted by this change and what do you suggest should be the downstream changes to accomodate this?

          p.s. I ask these questions to best utilize both our time and reduce the number of back/froths to the extent possible.

          Show
          Arvind Prabhakar added a comment - @John - are you suggesting that the grammar be updated to restrict single star argument with the specific function COUNT ? If not in the grammar where else do you think these restrictions should be coded. In either case, what other subsystems you think will be impacted by this change and what do you suggest should be the downstream changes to accomodate this? p.s. I ask these questions to best utilize both our time and reduce the number of back/froths to the extent possible.
          Hide
          John Sichi added a comment -

          Sorry, my commit for HIVE-1387 just introduced a new conflict, so one more rebase is required.

          I still don't think that we should allow anything other than COUNT in the parser. SUM should be disallowed, as should COUNT(DISTINCT *), as should MY_UDF. They are non-standard, and I think they would just cause confusion since COUNT ignores the column values rather than doing anything with them. Together with preventing these, add negative tests to verify that they are rejected.

          Given that, we can get rid of the isAllColumns info.

          Also, can you add an EXPLAIN for the COUNT query in the test?

          After that, I think we'll be good to go.

          Show
          John Sichi added a comment - Sorry, my commit for HIVE-1387 just introduced a new conflict, so one more rebase is required. I still don't think that we should allow anything other than COUNT in the parser. SUM should be disallowed, as should COUNT(DISTINCT *), as should MY_UDF . They are non-standard, and I think they would just cause confusion since COUNT ignores the column values rather than doing anything with them. Together with preventing these, add negative tests to verify that they are rejected. Given that, we can get rid of the isAllColumns info. Also, can you add an EXPLAIN for the COUNT query in the test? After that, I think we'll be good to go.
          Hide
          Arvind Prabhakar added a comment -

          @John: Can you please take a look at the updated patch? Let me know if you have any feedback for further tweaking this change as necessary.

          Show
          Arvind Prabhakar added a comment - @John: Can you please take a look at the updated patch? Let me know if you have any feedback for further tweaking this change as necessary.
          Hide
          Arvind Prabhakar added a comment -

          applies cleanly on trunk and branch-0.6

          Show
          Arvind Prabhakar added a comment - applies cleanly on trunk and branch-0.6
          Hide
          John Sichi added a comment -

          The object creation and method call here are negligible since this is in the semantic analyzer context (not an execution codepath).

          Show
          John Sichi added a comment - The object creation and method call here are negligible since this is in the semantic analyzer context (not an execution codepath).
          Hide
          Arvind Prabhakar added a comment -

          @John: I agree with your assessment above. Regarding the count, my earlier comment was not to imply that there exists a UDAF today, but that it might exist in the future. More importantly though, using an empty parameter list as an indicator for * would blur the distinction between UDAF vs UDAF() invocation. This is one way of many perhaps where parameter overloading could lead to confusion and hard to understand code.

          I think introducing GenericUDAFResolver2 interface is a great idea. I also like the idea of using a call back for decoupling the invocation from parameter list but am concerned that this could lead to perhaps redundant method call and object creation. I am not sure if that would add to any significant performance penalty in the long run or not.

          I would love to know what the opinion of others interested in this issue is regarding this route. If all agree that adding a new interface with callback for parameter discovery is acceptable, I can start working on that patch.

          Show
          Arvind Prabhakar added a comment - @John: I agree with your assessment above. Regarding the count , my earlier comment was not to imply that there exists a UDAF today, but that it might exist in the future. More importantly though, using an empty parameter list as an indicator for * would blur the distinction between UDAF vs UDAF() invocation. This is one way of many perhaps where parameter overloading could lead to confusion and hard to understand code. I think introducing GenericUDAFResolver2 interface is a great idea. I also like the idea of using a call back for decoupling the invocation from parameter list but am concerned that this could lead to perhaps redundant method call and object creation. I am not sure if that would add to any significant performance penalty in the long run or not. I would love to know what the opinion of others interested in this issue is regarding this route. If all agree that adding a new interface with callback for parameter discovery is acceptable, I can start working on that patch.
          Hide
          John Sichi added a comment -

          For DISTINCT: we can check the function invocation itself (during semantic analysis) by calling supportsDistinct() immediately after instantiating the GenericUDAFEvaluator in SemanticAnalyzer. This allows strict validation to be performed. Or make the method name checkDistinct and allow the UDAF to throw the exception itself. But I agree that in this case it would be cleaner to extend the interface, so I'm fine if we go ahead with that in a non-breaking fashion.

          For COUNT: if you think about it, COUNT really means "ignore all columns" not "count all columns". So I think an empty array actually makes a lot of sense here. Can you think of a case where UDAF even makes sense, where UDAF != COUNT? If you don't have access to any per-row data, what can you do other than count it? I'd say we should actually disallow * for anything but COUNT, per the SQL standard.

          I like your approach to keeping compatibility via instanceof, so if the decision ends up being to add the extra parameters, then we should definitely use that approach. However, extension points should always be interfaces (not abstract classes) to allow for stuff like dynamic proxies. So we would need to add a new interface GenericUDAFResolver2 (extends GenericUDAFResolver) with the new method, and make AbstractGenericUDAFResolver implement both.

          Interface evolution is never pretty, but there is an interface design pattern which avoids this particular problem. Imagine if originally we had defined a GenericUDAFResolverInput class inside of Hive itself, with a method getParameters() returning TypeInfo []. HIve would instantiate this and pass an input object into getEvaluator, and the evaluator would call input.getParameters(). This would have allowed us to add a boolean isDistinct() method to GenericUDAFResolverInput without breaking anything (source or binary) and without needing to add a new interface; old plugins would not know about isDistinct() so they wouldn't call it, and new ones could.

          I would argue that if we're going to go to the trouble of adding GenericUDAFResolver2, then we should build the pattern above into it as well in case we need further evolution later on.

          p.s. I'm really glad you're working on this one...every few days I try a count against Hive accidentally and then kick myself.

          Show
          John Sichi added a comment - For DISTINCT: we can check the function invocation itself (during semantic analysis) by calling supportsDistinct() immediately after instantiating the GenericUDAFEvaluator in SemanticAnalyzer. This allows strict validation to be performed. Or make the method name checkDistinct and allow the UDAF to throw the exception itself. But I agree that in this case it would be cleaner to extend the interface, so I'm fine if we go ahead with that in a non-breaking fashion. For COUNT : if you think about it, COUNT really means "ignore all columns" not "count all columns". So I think an empty array actually makes a lot of sense here. Can you think of a case where UDAF even makes sense, where UDAF != COUNT? If you don't have access to any per-row data, what can you do other than count it? I'd say we should actually disallow * for anything but COUNT, per the SQL standard. I like your approach to keeping compatibility via instanceof, so if the decision ends up being to add the extra parameters, then we should definitely use that approach. However, extension points should always be interfaces (not abstract classes) to allow for stuff like dynamic proxies. So we would need to add a new interface GenericUDAFResolver2 (extends GenericUDAFResolver) with the new method, and make AbstractGenericUDAFResolver implement both. Interface evolution is never pretty, but there is an interface design pattern which avoids this particular problem. Imagine if originally we had defined a GenericUDAFResolverInput class inside of Hive itself, with a method getParameters() returning TypeInfo []. HIve would instantiate this and pass an input object into getEvaluator, and the evaluator would call input.getParameters(). This would have allowed us to add a boolean isDistinct() method to GenericUDAFResolverInput without breaking anything (source or binary) and without needing to add a new interface; old plugins would not know about isDistinct() so they wouldn't call it, and new ones could. I would argue that if we're going to go to the trouble of adding GenericUDAFResolver2, then we should build the pattern above into it as well in case we need further evolution later on. p.s. I'm really glad you're working on this one...every few days I try a count against Hive accidentally and then kick myself.
          Hide
          Arvind Prabhakar added a comment -

          @John: Thanks for reviewing this change. I have some follow-up comments and suggestions:

          isDistinct: this doesn't actually modify the choice of evaluator implementation at all, since the actual duplicate elimination takes place upstream of the UDAF invocation. So instead of adding this parameter, can we instead add a new method supportsDistinct() on GenericUDAFEvaluator?

          While the evaluation may be happening upstream, I was concerned that it does not exclude the cases where this information is relevant to the function invocation itself. For example, the implementation of count requires that if there is a valid argument list, it must be qualified with DISTINCT.

          isAllColumns: COUNT is probably the only function which is ever even going to care about this one. Couldn't we just use an empty array of TypeInfo to indicate all columns?

          I had a similar idea, but after some consideration opted for a simpler design. I felt that overloading arguments to indicate special cases might lead to confusion and eventual problem when a use-case emerges that invalidates this assumption.

          I do agree with your point that it will be good to stay compatible if possible. One way to do it would be as follows:

          1. Revert the GenericUDAFResolver to its previous state but make the interface deprecated in favor of the abstract base class.
          2. Push the newly introduced method into AbstractGenericUDAFResolver implementation.
          3. Modify FunctionRegistry.getGenericUDAFEvaluator() method to test the resolver instance to be type compatible with AbstractGenericUDAFResolver and if so, invoke the new method. Otherwise revert to the old mechanism.

          What do you think about this approach?

          Show
          Arvind Prabhakar added a comment - @John: Thanks for reviewing this change. I have some follow-up comments and suggestions: isDistinct: this doesn't actually modify the choice of evaluator implementation at all, since the actual duplicate elimination takes place upstream of the UDAF invocation. So instead of adding this parameter, can we instead add a new method supportsDistinct() on GenericUDAFEvaluator? While the evaluation may be happening upstream, I was concerned that it does not exclude the cases where this information is relevant to the function invocation itself. For example, the implementation of count requires that if there is a valid argument list, it must be qualified with DISTINCT . isAllColumns: COUNT is probably the only function which is ever even going to care about this one. Couldn't we just use an empty array of TypeInfo to indicate all columns? I had a similar idea, but after some consideration opted for a simpler design. I felt that overloading arguments to indicate special cases might lead to confusion and eventual problem when a use-case emerges that invalidates this assumption. I do agree with your point that it will be good to stay compatible if possible. One way to do it would be as follows: Revert the GenericUDAFResolver to its previous state but make the interface deprecated in favor of the abstract base class. Push the newly introduced method into AbstractGenericUDAFResolver implementation. Modify FunctionRegistry.getGenericUDAFEvaluator() method to test the resolver instance to be type compatible with AbstractGenericUDAFResolver and if so, invoke the new method. Otherwise revert to the old mechanism. What do you think about this approach?
          Hide
          John Sichi added a comment -

          Sorry to chime in late on this one, but I have one big question on this one: can we instead do it in a way which does not break the UDAF interface?

          The existing patch adds a new method to the GenericUDAFResolver interface, meaning all existing plugin implementations outside of the Hive codebase will fail to compile (due to the fact that we did not already have the insulating abstract base class available). We already have some of these within Facebook.

          Let's analyze the two new parameters one by one.

          isDistinct: this doesn't actually modify the choice of evaluator implementation at all, since the actual duplicate elimination takes place upstream of the UDAF invocation. So instead of adding this parameter, can we instead add a new method supportsDistinct() on GenericUDAFEvaluator? Then call this after instantiating the new evaluator in order to carry out the additional validation.

          isAllColumns: COUNT is probably the only function which is ever even going to care about this one. Couldn't we just use an empty array of TypeInfo to indicate all columns?

          Independent of the above, I think adding the insulating abstract base should still be done now to make future transitions smoother when interface-breaking is absolutely required. So keep that part of the patch.

          Show
          John Sichi added a comment - Sorry to chime in late on this one, but I have one big question on this one: can we instead do it in a way which does not break the UDAF interface? The existing patch adds a new method to the GenericUDAFResolver interface, meaning all existing plugin implementations outside of the Hive codebase will fail to compile (due to the fact that we did not already have the insulating abstract base class available). We already have some of these within Facebook. Let's analyze the two new parameters one by one. isDistinct: this doesn't actually modify the choice of evaluator implementation at all, since the actual duplicate elimination takes place upstream of the UDAF invocation. So instead of adding this parameter, can we instead add a new method supportsDistinct() on GenericUDAFEvaluator? Then call this after instantiating the new evaluator in order to carry out the additional validation. isAllColumns: COUNT is probably the only function which is ever even going to care about this one. Couldn't we just use an empty array of TypeInfo to indicate all columns? Independent of the above, I think adding the insulating abstract base should still be done now to make future transitions smoother when interface-breaking is absolutely required. So keep that part of the patch.
          Hide
          Arvind Prabhakar added a comment -

          Submitting the regenerated patch with lastest trunk image. Patch file is HIVE-287-3.patch.

          Show
          Arvind Prabhakar added a comment - Submitting the regenerated patch with lastest trunk image. Patch file is HIVE-287 -3.patch.
          Hide
          Namit Jain added a comment -

          Arvind, I am sorry about the delay in getting back to this.
          Can you regenerate the patch - it is not applying cleanly.
          I will look at it right away

          Show
          Namit Jain added a comment - Arvind, I am sorry about the delay in getting back to this. Can you regenerate the patch - it is not applying cleanly. I will look at it right away
          Hide
          Arvind Prabhakar added a comment -

          Modified the implementation as per review feedback - Introduced an abstract base class and reverted the resolvers to extend that instead of directly implementing the new functionality.

          Show
          Arvind Prabhakar added a comment - Modified the implementation as per review feedback - Introduced an abstract base class and reverted the resolvers to extend that instead of directly implementing the new functionality.
          Hide
          Namit Jain added a comment -

          I am sorry about 1 and 3 - you can ignore those.

          2. It would be a good idea to maintain some compatibility for the existing interface - so, can we add another method to UDAFResolver, which has the new API - and a common class which invokes the default implementation, that would be better.

          Here is what I understand your suggestion as: Add a new method to GenericUDAFResolver interface maintaining the old method. Create an abstract base class that implements the new interface method and invokes the old method by dropping isDistinct/isAllColumn arguments. Extend the current resolvers to override this method. Will this address your concern? If not, can you provide a concrete example.

          – yes

          Show
          Namit Jain added a comment - I am sorry about 1 and 3 - you can ignore those. 2. It would be a good idea to maintain some compatibility for the existing interface - so, can we add another method to UDAFResolver, which has the new API - and a common class which invokes the default implementation, that would be better. Here is what I understand your suggestion as: Add a new method to GenericUDAFResolver interface maintaining the old method. Create an abstract base class that implements the new interface method and invokes the old method by dropping isDistinct/isAllColumn arguments. Extend the current resolvers to override this method. Will this address your concern? If not, can you provide a concrete example. – yes
          Hide
          Arvind Prabhakar added a comment -

          Thanks for taking a look at this patch Namit. I have some questions and clarifcations regarding your feedback:

          1. This should be independent of COUNT - so, all basically all aggregation functions should be supported with DISTINCT.

          For eg: select avg(distinct c1,c2) from T

          Not sure how this relates to the change I made. Even before making this change, the DISTINCT qualifier was allowed for any function invocation. Can you elaborate what you mean by this? Specifically, which part of the patch needs to be changed in order to accomodate this request.

          2. It would be a good idea to maintain some compatibility for the existing interface - so, can we add another method to UDAFResolver, which has the new API - and a common class which invokes the default implementation, that would be better.

          Here is what I understand your suggestion as: Add a new method to GenericUDAFResolver interface maintaining the old method. Create an abstract base class that implements the new interface method and invokes the old method by dropping isDistinct/isAllColumn arguments. Extend the current resolvers to override this method. Will this address your concern? If not, can you provide a concrete example.

          3. Follows from 1 - more tests are needed

          Are you suggesting more tests for array_contains UDF or to add more tests for other UDFs? Please clarify with examples if possible.

          Show
          Arvind Prabhakar added a comment - Thanks for taking a look at this patch Namit. I have some questions and clarifcations regarding your feedback: 1. This should be independent of COUNT - so, all basically all aggregation functions should be supported with DISTINCT. For eg: select avg(distinct c1,c2) from T Not sure how this relates to the change I made. Even before making this change, the DISTINCT qualifier was allowed for any function invocation. Can you elaborate what you mean by this? Specifically, which part of the patch needs to be changed in order to accomodate this request. 2. It would be a good idea to maintain some compatibility for the existing interface - so, can we add another method to UDAFResolver, which has the new API - and a common class which invokes the default implementation, that would be better. Here is what I understand your suggestion as: Add a new method to GenericUDAFResolver interface maintaining the old method. Create an abstract base class that implements the new interface method and invokes the old method by dropping isDistinct/isAllColumn arguments. Extend the current resolvers to override this method. Will this address your concern? If not, can you provide a concrete example. 3. Follows from 1 - more tests are needed Are you suggesting more tests for array_contains UDF or to add more tests for other UDFs? Please clarify with examples if possible.
          Hide
          Namit Jain added a comment -

          Overall looks good, some minor comments:

          1. This should be independent of COUNT - so, all basically all aggregation functions should be supported with DISTINCT.
          For eg: select avg(distinct c1,c2) from T

          and so on.

          2. It would be a good idea to maintain some compatibility for the existing interface - so, can we add another method to UDAFResolver, which
          has the new API - and a common class which invokes the default implementation, that would be better.

          3. Follows from 1 - more tests are needed

          Show
          Namit Jain added a comment - Overall looks good, some minor comments: 1. This should be independent of COUNT - so, all basically all aggregation functions should be supported with DISTINCT. For eg: select avg(distinct c1,c2) from T and so on. 2. It would be a good idea to maintain some compatibility for the existing interface - so, can we add another method to UDAFResolver, which has the new API - and a common class which invokes the default implementation, that would be better. 3. Follows from 1 - more tests are needed
          Hide
          Arvind Prabhakar added a comment -

          Summary
          This patch fixes the count() aggregate function to be consistent with SQL. Specifically:

          • Provides support for SELECT count FROM table queries, where it returns the total number of rows of the table.
          • Also extended the support for count() to include multiple expression list. count(DISTINCT expr1, exp2,...) returns the number of non-NULL and different valued rows from the evaluated expressions.

          Details

          • Modified HiveQL grammar to allow function invocation with a single * in place of parameter list.
          • Propagated the presence of * as parameter or specification of DISTINCT keyword in the UDF resolver framework so that it can be used by UDFs that behave differently when these are applicable.
          • Modified the count() UDAF to support the same semantics of handling NULL values as regular SQL.
          • Added test case to specifically exercise the newly introduced semantics of the count UDAF.

          Testing
          Ran all tests. Noted only two failures (input20.q, input33.q) which were found to be failing on the local trunk image as well.

          If and when this patch is committed to the trunk, I will go ahead and update the Hive Wiki with details and examples regarding the use of count() UDAF in various forms.

          Show
          Arvind Prabhakar added a comment - Summary This patch fixes the count() aggregate function to be consistent with SQL. Specifically: Provides support for SELECT count FROM table queries, where it returns the total number of rows of the table. Also extended the support for count() to include multiple expression list. count(DISTINCT expr1, exp2,...) returns the number of non-NULL and different valued rows from the evaluated expressions. Details Modified HiveQL grammar to allow function invocation with a single * in place of parameter list. Propagated the presence of * as parameter or specification of DISTINCT keyword in the UDF resolver framework so that it can be used by UDFs that behave differently when these are applicable. Modified the count() UDAF to support the same semantics of handling NULL values as regular SQL. Added test case to specifically exercise the newly introduced semantics of the count UDAF. Testing Ran all tests. Noted only two failures (input20.q, input33.q) which were found to be failing on the local trunk image as well. If and when this patch is committed to the trunk, I will go ahead and update the Hive Wiki with details and examples regarding the use of count() UDAF in various forms.
          Hide
          Namit Jain added a comment -

          You need to use an alias for the sub-query

          select count(1) from (select product_url, name, count(1) from raw_product group by product_url, name)x

          Show
          Namit Jain added a comment - You need to use an alias for the sub-query select count(1) from (select product_url, name, count(1) from raw_product group by product_url, name)x
          Hide
          Patrick Salami added a comment -

          Thanks for the reply! However, I get a parse error when I try this query:

          hive> select count(1) from (select product_url, name, count(1) from
          raw_product group by product_url, name);
          FAILED: Parse Error: line 0:-1 mismatched input '<EOF>' expecting
          Identifier in group by expression

          here is my schema:
          hive> describe raw_product;
          OK
          id int
          merchant_product_id int
          network_id int
          network_merchant_id int
          name string
          brand string
          category_name string
          sku string
          product_url string
          [...]

          Thanks!

          --Patrick


          ___________________________________
          Patrick Salami | Senior Software Engineer
          Extrabux, Inc | http://www.extrabux.com
          patrick@extrabux.com | (858) 449-2241

          Show
          Patrick Salami added a comment - Thanks for the reply! However, I get a parse error when I try this query: hive> select count(1) from (select product_url, name, count(1) from raw_product group by product_url, name); FAILED: Parse Error: line 0:-1 mismatched input '<EOF>' expecting Identifier in group by expression here is my schema: hive> describe raw_product; OK id int merchant_product_id int network_id int network_merchant_id int name string brand string category_name string sku string product_url string [...] Thanks! --Patrick – ___________________________________ Patrick Salami | Senior Software Engineer Extrabux, Inc | http://www.extrabux.com patrick@extrabux.com | (858) 449-2241
          Hide
          Namit Jain added a comment -

          can you try:

          select count(1) from (select col1, col2, count(1) from tbl group by col1, col2);

          Show
          Namit Jain added a comment - can you try: select count(1) from (select col1, col2, count(1) from tbl group by col1, col2);
          Hide
          Patrick Salami added a comment -

          confirmed: happened to me as well; anyone have any workarounds?

          Show
          Patrick Salami added a comment - confirmed: happened to me as well; anyone have any workarounds?
          Hide
          Raghotham Murthy added a comment -

          results in a compiler error:

          FAILED: Error in semantic analysis: line 2:7 Function Argument Type Mismatch count: Looking for UDAF Evaluator"count" with parameters [class java.lang.String, class java.lang.String]

          Show
          Raghotham Murthy added a comment - results in a compiler error: FAILED: Error in semantic analysis: line 2:7 Function Argument Type Mismatch count: Looking for UDAF Evaluator"count" with parameters [class java.lang.String, class java.lang.String]

            People

            • Assignee:
              Arvind Prabhakar
              Reporter:
              Namit Jain
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development