Hive
  1. Hive
  2. HIVE-362

avg() ignores null values; consider variant that doesn't

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None

      Description

      Some of the current aggregates (sum, avg) have a fairly standard behavior: If any item in the list is NULL, the sum, average, etc., cannot be computed. And so, NULL is returned.

      1) If this is the case, the query should return much faster--see a null, return NULL, exit(0).

      2) It would be nice to have versions or ways to use these functions with NULL data--specifically, to treat the NULL as zero or to ignore the NULL and return the results for non-NULL data.

      This also would apply to the variance functions referenced in https://issues.apache.org/jira/browse/HIVE-165

        Activity

        Hide
        Adam Kramer added a comment -

        This is not a problem anymore.

        Just to be a bit opinionated for a few moments, though, I do believe the standards to be wrong on this issue; NULL values are an excellent way to force scientists to really think about the query they're running, and implicitly removing them will generally lead to harder-to-debug errors and more wasted time than having to call the "remove nulls" version, call it avg_rn, explicitly.

        Show
        Adam Kramer added a comment - This is not a problem anymore. Just to be a bit opinionated for a few moments, though, I do believe the standards to be wrong on this issue; NULL values are an excellent way to force scientists to really think about the query they're running, and implicitly removing them will generally lead to harder-to-debug errors and more wasted time than having to call the "remove nulls" version, call it avg_rn, explicitly.
        Hide
        Zheng Shao added a comment -

        Hi Adam, it seems this is not a problem any more. Can I close the issue?

        Show
        Zheng Shao added a comment - Hi Adam, it seems this is not a problem any more. Can I close the issue?
        Hide
        Raghotham Murthy added a comment - - edited

        In your example, even with NULLs, SUM(a) / COUNT(a) will be the same as AVG(a). This is because COUNT(a) returns the number of rows which have non-null 'a'. This is different from COUNT(1) which returns the count of all rows.

        Show
        Raghotham Murthy added a comment - - edited In your example, even with NULLs, SUM(a) / COUNT(a) will be the same as AVG(a). This is because COUNT(a) returns the number of rows which have non-null 'a'. This is different from COUNT(1) which returns the count of all rows.
        Hide
        Adam Kramer added a comment -

        Very sorry! After further testing, the actual problem I was having was that SUM returns a double, and I was putting it into an INT container without an explicit cast.

        So, maybe this bug is a request to change the wiki to make this clear.

        Also...let's actually say the opposite. We really should have AVG or some version of it return NULL if any NULL values are found. This is the default behavior in R, and for good reason...I will link to the description of why once I find it.

        Basically, the argument is that when a sum, average, or variance depends on n in some sense, comparisons among these groups cannot be trusted. For example, if there are 5 nulls in a set of 20 variables, SUM / COUNT returns a different value than AVG...that is very unintuitive.

        Show
        Adam Kramer added a comment - Very sorry! After further testing, the actual problem I was having was that SUM returns a double, and I was putting it into an INT container without an explicit cast. So, maybe this bug is a request to change the wiki to make this clear. Also...let's actually say the opposite. We really should have AVG or some version of it return NULL if any NULL values are found. This is the default behavior in R, and for good reason...I will link to the description of why once I find it. Basically, the argument is that when a sum, average, or variance depends on n in some sense, comparisons among these groups cannot be trusted. For example, if there are 5 nulls in a set of 20 variables, SUM / COUNT returns a different value than AVG ...that is very unintuitive.
        Hide
        Raghotham Murthy added a comment -

        Marking this as a bug.

        Show
        Raghotham Murthy added a comment - Marking this as a bug.
        Hide
        Raghotham Murthy added a comment -

        Aggregates are supposed to ignore nulls according to the standard. The code for SUM and AVG seems to indicate that they do ignore NULLs. Are you getting NULLs?

        Show
        Raghotham Murthy added a comment - Aggregates are supposed to ignore nulls according to the standard. The code for SUM and AVG seems to indicate that they do ignore NULLs. Are you getting NULLs?

          People

          • Assignee:
            Unassigned
            Reporter:
            Adam Kramer
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development