Pig
  1. Pig
  2. PIG-1014

Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.4.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Activity

      Pradeep Kamath created issue -
      Hide
      Santhosh Srinivasan added a comment -

      When the semantics of COUNT was changed, I thought this was communicated with the users. What is the intention of this jira?

      Show
      Santhosh Srinivasan added a comment - When the semantics of COUNT was changed, I thought this was communicated with the users. What is the intention of this jira?
      Hide
      Pradeep Kamath added a comment -

      The jira is to track if it is possible to automatically convert a COUNT(relation) in the script to COUNT_STAR(relation) in the plan so that nullness of the fields in the records is not considered while returning the count. For example if a relation (A) has two fields and there is the following script snippet:

      B = group A by $0;
      C = foreach B generate group, COUNT(A);
      

      This is equivalent to a count after grouping on the first column in SQL. Per SQL semantics, COUNT counts all records for the group without regard to the nullness of the individual fields. This behavior is achieved through COUNT_STAR built -in in pig. However COUNT built-in in pig is meant for counting a bag with a single column (for example COUNT(A.$0) above). So the implementation in COUNT checks if the first field in the bag is null or not and only counts non null values. In the above script if the first column in the bag is null for any record, it does not get counted which would not be the same as the expected result for COUNT in SQL. So if the compilation phase in pig can detect that the COUNT is being performed on a whole relation (rather than an individual column), it can replace the COUNT with COUNT_STAR and achieve the desired result.

      Show
      Pradeep Kamath added a comment - The jira is to track if it is possible to automatically convert a COUNT(relation) in the script to COUNT_STAR(relation) in the plan so that nullness of the fields in the records is not considered while returning the count. For example if a relation (A) has two fields and there is the following script snippet: B = group A by $0; C = foreach B generate group, COUNT(A); This is equivalent to a count after grouping on the first column in SQL. Per SQL semantics, COUNT counts all records for the group without regard to the nullness of the individual fields. This behavior is achieved through COUNT_STAR built -in in pig. However COUNT built-in in pig is meant for counting a bag with a single column (for example COUNT(A.$0) above). So the implementation in COUNT checks if the first field in the bag is null or not and only counts non null values. In the above script if the first column in the bag is null for any record, it does not get counted which would not be the same as the expected result for COUNT in SQL. So if the compilation phase in pig can detect that the COUNT is being performed on a whole relation (rather than an individual column), it can replace the COUNT with COUNT_STAR and achieve the desired result.
      Hide
      Santhosh Srinivasan added a comment -

      Is Pig trying to guess the user's intent? What if the user wanted to do count without nulls ?

      Show
      Santhosh Srinivasan added a comment - Is Pig trying to guess the user's intent? What if the user wanted to do count without nulls ?
      Hide
      Dmitriy V. Ryaboy added a comment -

      Santosh – if the user wanted to do a "count without nulls in the first field" then she should COUNT(A.$0). I think Pradeep's suggestion causes least surprise to the end user (at least, that's the behaviour I would have expected had I not seen this ticket).

      Show
      Dmitriy V. Ryaboy added a comment - Santosh – if the user wanted to do a "count without nulls in the first field" then she should COUNT(A.$0). I think Pradeep's suggestion causes least surprise to the end user (at least, that's the behaviour I would have expected had I not seen this ticket).
      Hide
      Santhosh Srinivasan added a comment -

      If the user wants to count without nulls then the user should use COUNT_STAR. One of the philosophies of Pig has been to allow users to do exactly what they want. Here, we are violating that philosophy and secondly we are second guessing the user's intention.

      Show
      Santhosh Srinivasan added a comment - If the user wants to count without nulls then the user should use COUNT_STAR. One of the philosophies of Pig has been to allow users to do exactly what they want. Here, we are violating that philosophy and secondly we are second guessing the user's intention.
      Hide
      Alan Gates added a comment -

      I think I agree with Santhosh here. While it may be unfortunate that our syntax makes it difficult to match the rather strange semantics of COUNT vs COUNT in SQL, I'm not sure trying to make a distinct between COUNT(A) and COUNT(A.$0) is the right solution. This will not be obvious at all to users. If anything, the right way to do this would be COUNT(A.*), but I'm not sure even about that.

      Show
      Alan Gates added a comment - I think I agree with Santhosh here. While it may be unfortunate that our syntax makes it difficult to match the rather strange semantics of COUNT vs COUNT in SQL, I'm not sure trying to make a distinct between COUNT(A) and COUNT(A.$0) is the right solution. This will not be obvious at all to users. If anything, the right way to do this would be COUNT(A.*), but I'm not sure even about that.
      Hide
      Pradeep Kamath added a comment -

      The issue I see is with the implementation of COUNT today. It looks at only the first field in the bag and counts only non null values towards the result. This can lead to mysterious results. Consider a relation (A) with two fields with the following contents:

      1 2
      3 4
      null 6
      7 null
      null null
      

      If we have the following snippet:

      B = group A all;
      C = foreach B generate COUNT(A);
      

      The answer is 3 which was arrived at only by considering record 1, record 2 and record 4 since the other records have null in the first position. Ironically though record 4 has null in the second position that does not prevent it from being not counted. So the result being based on the null-ness of just the first field seems somewhat arbitrary. My concern is that most users would not know that the result was arrived at after dropping records which had null in the first field even though they did not specify COUNT(A.$0). Status Quo means we equate COUNT(A) to COUNT(A.$0) which is also not apparent to users.

      Show
      Pradeep Kamath added a comment - The issue I see is with the implementation of COUNT today. It looks at only the first field in the bag and counts only non null values towards the result. This can lead to mysterious results. Consider a relation (A) with two fields with the following contents: 1 2 3 4 null 6 7 null null null If we have the following snippet: B = group A all; C = foreach B generate COUNT(A); The answer is 3 which was arrived at only by considering record 1, record 2 and record 4 since the other records have null in the first position. Ironically though record 4 has null in the second position that does not prevent it from being not counted. So the result being based on the null-ness of just the first field seems somewhat arbitrary. My concern is that most users would not know that the result was arrived at after dropping records which had null in the first field even though they did not specify COUNT(A.$0). Status Quo means we equate COUNT(A) to COUNT(A.$0) which is also not apparent to users.
      Hide
      Santhosh Srinivasan added a comment -

      Essentially, Pradeep is pointing out an issue in the implementation of COUNT. If that is the case then COUNT has to be fixed or the semantics of COUNT has to be documented to explain the current implementation. I would vote for fixing COUNT to have the correct semantics.

      Show
      Santhosh Srinivasan added a comment - Essentially, Pradeep is pointing out an issue in the implementation of COUNT. If that is the case then COUNT has to be fixed or the semantics of COUNT has to be documented to explain the current implementation. I would vote for fixing COUNT to have the correct semantics.
      Hide
      Pradeep Kamath added a comment -

      Here are the options we have in my mind:
      1) status Quo which is confusing - so not quite an option
      2) We disallow anything other than counting a single column with COUNT - So COUNT(A), COUNT(A.), COUNT(A.($0,$1)) all would throw an error saying COUNT does not support this - I this case would COUNT_STAR be the suggestion to users? It would seem so for COUNT(A) and COUNT(A.) not so sure for COUNT(A.($0,$1))
      3) Same as 2) but instead of throwing an error, the parser replaces COUNT with COUNT_STAR - here again we need to decide what we should do with COUNT(A.($0,$1))
      4) Change COUNT to look at all fields in the bag and only if all fields are null not count that towards the result - In this case, COUNT(A) and COUNT_STAR(A) will not give same results but we could document this.

      Show
      Pradeep Kamath added a comment - Here are the options we have in my mind: 1) status Quo which is confusing - so not quite an option 2) We disallow anything other than counting a single column with COUNT - So COUNT(A), COUNT(A. ), COUNT(A.($0,$1)) all would throw an error saying COUNT does not support this - I this case would COUNT_STAR be the suggestion to users? It would seem so for COUNT(A) and COUNT(A. ) not so sure for COUNT(A.($0,$1)) 3) Same as 2) but instead of throwing an error, the parser replaces COUNT with COUNT_STAR - here again we need to decide what we should do with COUNT(A.($0,$1)) 4) Change COUNT to look at all fields in the bag and only if all fields are null not count that towards the result - In this case, COUNT(A) and COUNT_STAR(A) will not give same results but we could document this.
      Hide
      Santhosh Srinivasan added a comment -

      After a discussion with Pradeep who also graciously ran SQL queries to verify semantics, we have the following proposal:

      The semantics of COUNT could be defined as:

      1. COUNT( A ) is equivalent to COUNT( A.* ) and the result of COUNT( A ) will count null tuples in the relation
      2. COUNT( A.$0) will not count null tuples in the relation

      3. COUNT(A.($0, $1)) is equivalent to COUNT( A1.* ) where A1 is the relation containing tuples with two columns and will exhibit the behavior of statement 1

      OR

      3. COUNT(A.($0, $1)) is equivalent to COUNT( A1.* ) where A1 is the relation containing tuples with two columns and will exhibit the behavior of statement 2

      Point 3 needs more discussion.

      Comments/thoughts/suggestions/anything else welcome.

      Show
      Santhosh Srinivasan added a comment - After a discussion with Pradeep who also graciously ran SQL queries to verify semantics, we have the following proposal: The semantics of COUNT could be defined as: 1. COUNT( A ) is equivalent to COUNT( A.* ) and the result of COUNT( A ) will count null tuples in the relation 2. COUNT( A.$0) will not count null tuples in the relation 3. COUNT(A.($0, $1)) is equivalent to COUNT( A1.* ) where A1 is the relation containing tuples with two columns and will exhibit the behavior of statement 1 OR 3. COUNT(A.($0, $1)) is equivalent to COUNT( A1.* ) where A1 is the relation containing tuples with two columns and will exhibit the behavior of statement 2 Point 3 needs more discussion. Comments/thoughts/suggestions/anything else welcome.
      Hide
      Pradeep Kamath added a comment -

      To achieve 1. above, we would translate COUNT( A ) to COUNT_STAR( A ) during job compilation. Since 3. above has multiple options and does not seem to be a prevalent use case (SQL does not support it), another option is to disable it - thoughts?

      Show
      Pradeep Kamath added a comment - To achieve 1. above, we would translate COUNT( A ) to COUNT_STAR( A ) during job compilation. Since 3. above has multiple options and does not seem to be a prevalent use case (SQL does not support it), another option is to disable it - thoughts?
      Hide
      Santhosh Srinivasan added a comment -

      Another option is to change the implementation of COUNT to reflect the proposed semantics. If the underlying UDF is changed then the user should be notified via an information message. If the user checks the explain output then (s)he will notice COUNT_STAR and will be confused.

      Show
      Santhosh Srinivasan added a comment - Another option is to change the implementation of COUNT to reflect the proposed semantics. If the underlying UDF is changed then the user should be notified via an information message. If the user checks the explain output then (s)he will notice COUNT_STAR and will be confused.
      Hide
      Dmitriy V. Ryaboy added a comment -

      A link that talks about some of the more "interesting" behaviors of NULL in SQL: http://thoughts.j-davis.com/2009/08/02/what-is-the-deal-with-nulls/

      The difference between COUNT and COUNT_STAR is that COUNT_STAR counts nulls. I think this ticket boils down to the question, "what do we consider a null tuple?". At the moment, we consider A.$0 to determine whether the tuple is null; that doesn't seem right, and surprises users. We have two options that both make sense – a null tuple is a tuple in which all fields are null, or a null tuple is a tuple which is completely null (ie, doesn't even have any fields). I am in favor of the first definition, which is a superset of the second.

      Show
      Dmitriy V. Ryaboy added a comment - A link that talks about some of the more "interesting" behaviors of NULL in SQL: http://thoughts.j-davis.com/2009/08/02/what-is-the-deal-with-nulls/ The difference between COUNT and COUNT_STAR is that COUNT_STAR counts nulls. I think this ticket boils down to the question, "what do we consider a null tuple?". At the moment, we consider A.$0 to determine whether the tuple is null; that doesn't seem right, and surprises users. We have two options that both make sense – a null tuple is a tuple in which all fields are null, or a null tuple is a tuple which is completely null (ie, doesn't even have any fields). I am in favor of the first definition, which is a superset of the second.

        People

        • Assignee:
          Unassigned
          Reporter:
          Pradeep Kamath
        • Votes:
          1 Vote for this issue
          Watchers:
          0 Start watching this issue

          Dates

          • Created:
            Updated:

            Development