Pig
  1. Pig
  2. PIG-1188

Padding nulls to the input tuple according to input schema

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.9.0
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      If load statement specify schema, Pig will truncate/padding null to make sure the loaded data has exactly the same number of fields specified in load statement.

      Description

      Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:

      Pig script:

      a = load '1.txt' as (a0, a1);
      dump a;
      

      Input file:

      1       2
      1       2       3
      1
      

      Current result:

      (1,2)
      (1,2,3)
      (1)
      

      Desired result:

      (1,2)
      (1,2)
      (1, null)
      
      1. PIG-1188-2.patch
        20 kB
        Daniel Dai
      2. PIG-1188-1.patch
        3 kB
        Daniel Dai

        Issue Links

          Activity

          Jie Li made changes -
          Link This issue relates to PIG-2661 [ PIG-2661 ]
          Olga Natkovich made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Daniel Dai made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Release Note If load statement specify schema, Pig will truncate/padding null to make sure the loaded data has exactly the same number of fields specified in load statement.
          Resolution Fixed [ 1 ]
          Hide
          Daniel Dai added a comment -

          Patch committed to trunk

          Show
          Daniel Dai added a comment - Patch committed to trunk
          Hide
          Richard Ding added a comment -

          +1

          Show
          Richard Ding added a comment - +1
          Daniel Dai made changes -
          Attachment PIG-1188-2.patch [ 12473167 ]
          Hide
          Daniel Dai added a comment -

          PIG-1188-2.patch fix unit test failures.

          Show
          Daniel Dai added a comment - PIG-1188 -2.patch fix unit test failures.
          Daniel Dai made changes -
          Attachment PIG-1188-1.patch [ 12472881 ]
          Hide
          Olga Natkovich added a comment -

          The remaining work for Pig 0.9 on this issue is to make sure that the same behavior is implemented for the case where schema is provided during load regardless of whether type information is provided.

          We want to implement the current behavior of the typed data for the untyped case.

          I believe the way we agreed to do this is by adding foreach regardless of whether type information is available

          Show
          Olga Natkovich added a comment - The remaining work for Pig 0.9 on this issue is to make sure that the same behavior is implemented for the case where schema is provided during load regardless of whether type information is provided. We want to implement the current behavior of the typed data for the untyped case. I believe the way we agreed to do this is by adding foreach regardless of whether type information is available
          Alan Gates made changes -
          Assignee Alan Gates [ alangates ] Daniel Dai [ daijy ]
          Alan Gates made changes -
          Assignee Richard Ding [ rding ] Alan Gates [ alangates ]
          Olga Natkovich made changes -
          Fix Version/s 0.9.0 [ 12315191 ]
          Olga Natkovich made changes -
          Fix Version/s 0.7.0 [ 12314397 ]
          Hide
          Olga Natkovich added a comment -

          Looks like most common cases are already working. Unlinking from 0.7.0 release.

          Show
          Olga Natkovich added a comment - Looks like most common cases are already working. Unlinking from 0.7.0 release.
          Hide
          Richard Ding added a comment -

          To summarize where we are:

          Right now Pig project operator pads null if the value to be projected doesn't exist. As a consequence, the desired result is achieved if PigStorage is used and a schema with data types is specified, since in this case Pig inserts a project+cast operator for each field in the schema.

          In the case where no schema is specified in the load statement, Pig is doing a good job adhering to the Pig's philosophy and let the program run without throwing runtime exception.

          Now leave the case where a schema is specified without data types. There are several options:

          • Pig automatically insert a project operator for each field in the schema to ensure the input data matches the schema. The trade-off for this is the performance penalty. Is it worthwhile if most user data is well-behaved?
          • Users can explicitly add a foreach statement after the load statement which projects all the fields in the schema. This is similar to the practice by the users to run a map job first to cleanup the data.
          • Pig can also delegate the padding work to the loaders. The problem is that now the schema isn't passed to the loaders.
          Show
          Richard Ding added a comment - To summarize where we are: Right now Pig project operator pads null if the value to be projected doesn't exist. As a consequence, the desired result is achieved if PigStorage is used and a schema with data types is specified, since in this case Pig inserts a project+cast operator for each field in the schema. In the case where no schema is specified in the load statement, Pig is doing a good job adhering to the Pig's philosophy and let the program run without throwing runtime exception. Now leave the case where a schema is specified without data types. There are several options: Pig automatically insert a project operator for each field in the schema to ensure the input data matches the schema. The trade-off for this is the performance penalty. Is it worthwhile if most user data is well-behaved? Users can explicitly add a foreach statement after the load statement which projects all the fields in the schema. This is similar to the practice by the users to run a map job first to cleanup the data. Pig can also delegate the padding work to the loaders. The problem is that now the schema isn't passed to the loaders.
          Hide
          Richard Ding added a comment -

          Actually, Pig is already padding nulls to the input tuple according to input schema (with data types):

          For example, given Pig script:

          a = load '1.txt' as (a0:int, a1:int);
          dump a;
          

          and input file:

          1       2
          1       2       3
          1
          

          The result is

          (1,2)
          (1,2)
          (1, null)
          
          Show
          Richard Ding added a comment - Actually, Pig is already padding nulls to the input tuple according to input schema (with data types): For example, given Pig script: a = load '1.txt' as (a0: int , a1: int ); dump a; and input file: 1 2 1 2 3 1 The result is (1,2) (1,2) (1, null )
          Hide
          Richard Ding added a comment -

          I suggest we don't change the current behavior of Pig regarding the non-confirming input data. Pig already handles invalid access (projection) of non-exist field and return a null as a substitute. Pig does this optimistically, not checking every tuple up front.

          With PIG-1131, the runtime exception user encountered is also fixed.

          Show
          Richard Ding added a comment - I suggest we don't change the current behavior of Pig regarding the non-confirming input data. Pig already handles invalid access (projection) of non-exist field and return a null as a substitute. Pig does this optimistically, not checking every tuple up front. With PIG-1131 , the runtime exception user encountered is also fixed.
          Hide
          Alan Gates added a comment -

          A few thoughts:

          In a job that is going to process a billion rows and run for 3 hours 1 bad row should not cause the whole job to fail.

          This invalid access should certainly cause a warning. Users can look at the warnings at the end of the query and decide they do not want to keep the output because of the warnings. But failure should not be the default case (see previous point). Perhaps we should have a warnings = error option like compilers do so users who are very worried about the warnings can make sure they fail. But that's a different proposal for a different JIRA.

          Third, doing further operations on these columns down the pipeline may result in non-predictable results in other operators.

          I don't follow. Nulls in the pipeline shouldn't cause a problem. UDFs and operators need to be able to handle null values whether they come from processing or from the data itself.

          Second, it can't be assumed that user wants those non-existent field to be treated as null. If he wants it that way, he should implement LoadFunc interface which treats them that way.

          One could argue that it can't be assumed the user wants his query to fail when a field is missing. We have to assume one way or another. Null is a better assumption than failure, since it is possible for a user who doesn't want that behavior to detect it and deal with it. As it is now, the user has to modify his data or write a new load function to deal with padding his data.

          I agree with you that in the schema case, it would be ideal if not having a field was an error. However, given the architecture this is difficult. And stipulating that load functions test every record to assure it matches the schema is too much of a performance penalty. But for the non-schema case I don't agree. Pig's philsophy of "Pigs eat anything" doesn't mean much if Pig gags as soon as it gets a record that doesn't match it's expectation.

          Show
          Alan Gates added a comment - A few thoughts: In a job that is going to process a billion rows and run for 3 hours 1 bad row should not cause the whole job to fail. This invalid access should certainly cause a warning. Users can look at the warnings at the end of the query and decide they do not want to keep the output because of the warnings. But failure should not be the default case (see previous point). Perhaps we should have a warnings = error option like compilers do so users who are very worried about the warnings can make sure they fail. But that's a different proposal for a different JIRA. Third, doing further operations on these columns down the pipeline may result in non-predictable results in other operators. I don't follow. Nulls in the pipeline shouldn't cause a problem. UDFs and operators need to be able to handle null values whether they come from processing or from the data itself. Second, it can't be assumed that user wants those non-existent field to be treated as null. If he wants it that way, he should implement LoadFunc interface which treats them that way. One could argue that it can't be assumed the user wants his query to fail when a field is missing. We have to assume one way or another. Null is a better assumption than failure, since it is possible for a user who doesn't want that behavior to detect it and deal with it. As it is now, the user has to modify his data or write a new load function to deal with padding his data. I agree with you that in the schema case, it would be ideal if not having a field was an error. However, given the architecture this is difficult. And stipulating that load functions test every record to assure it matches the schema is too much of a performance penalty. But for the non-schema case I don't agree. Pig's philsophy of "Pigs eat anything" doesn't mean much if Pig gags as soon as it gets a record that doesn't match it's expectation.
          Ashutosh Chauhan made changes -
          Link This issue relates to PIG-1131 [ PIG-1131 ]
          Hide
          Ashutosh Chauhan added a comment -

          I have a different take on this. Referring to original description of Jira, I would expect Pig's behavior should be one given in "Current result" and not as given in "Desired result". Pig should not try to do anything behind the scenes with data which "Desired result" is proposing to do. In cases where columns are not consistent, there are two scenarios with or without schema. If user did supply the schema, then I would consider that user is telling to Pig that data is consistent with the schema he is providing and if thats not the case, its perfectly fine to throw exception at runtime. Tricky case is when schema is not provided and user tries to access a non-existent field. I think even in such cases its valid to throw exception at runtime, instead of returning null. First, if user is trying to access a non-existent field thats an error condition in any case. Second, it can't be assumed that user wants those non-existent field to be treated as null. If he wants it that way, he should implement LoadFunc interface which treats them that way. Third, doing further operations on these columns down the pipeline may result in non-predictable results in other operators. Fourth, returning null will obscure the bugs in Pig where Pig (instead of user himself) tries to access non-existent fields to construct new tuples at run time to do e.g. joins (see PIG-1131).

          In short, I am suggesting that Pig should continue to have a behavior it has today. That is it can load variable number of columns in a tuple. But, if user access a non-existent field throw the exception and let user deal with such scenario himself by implementing his own LoadFunc interface.

          Thoughts ?

          Show
          Ashutosh Chauhan added a comment - I have a different take on this. Referring to original description of Jira, I would expect Pig's behavior should be one given in "Current result" and not as given in "Desired result". Pig should not try to do anything behind the scenes with data which "Desired result" is proposing to do. In cases where columns are not consistent, there are two scenarios with or without schema. If user did supply the schema, then I would consider that user is telling to Pig that data is consistent with the schema he is providing and if thats not the case, its perfectly fine to throw exception at runtime. Tricky case is when schema is not provided and user tries to access a non-existent field. I think even in such cases its valid to throw exception at runtime, instead of returning null. First, if user is trying to access a non-existent field thats an error condition in any case. Second, it can't be assumed that user wants those non-existent field to be treated as null. If he wants it that way, he should implement LoadFunc interface which treats them that way. Third, doing further operations on these columns down the pipeline may result in non-predictable results in other operators. Fourth, returning null will obscure the bugs in Pig where Pig (instead of user himself) tries to access non-existent fields to construct new tuples at run time to do e.g. joins (see PIG-1131 ). In short, I am suggesting that Pig should continue to have a behavior it has today. That is it can load variable number of columns in a tuple. But, if user access a non-existent field throw the exception and let user deal with such scenario himself by implementing his own LoadFunc interface. Thoughts ?
          Olga Natkovich made changes -
          Field Original Value New Value
          Assignee Richard Ding [ rding ]
          Hide
          Alan Gates added a comment -

          After further thought I want to change my position on this.

          There are two cases to consider, when schema is present and when it isn't. The problem is by the time Pig is trying to access the missing field (in the backend), it has no idea whether the schema exists or not. So at runtime, Pig should just return a null if it gets ArrayOutOfBoundsException.

          How to pad missing data should be left up to the load function. Perhaps certain load functions do know how to pad missing data, or are ok with the pad at the end scheme proposed here. If the load function does not check, then Pig would effectively pad at the end, given the proposal above. If the load function implementer does not what this to happen, s/he can check each tuple being read from the input to assure it matches the schema, and then decide to pad the tuple with nulls, reject the tuple, or return a tuple full of nulls.

          In the case of PigStorage, checking each tuple for a match against the schema is too expensive. Ideally I would like it to, because I think that when the user gives a schema it's an error if the data doesn't match. But I don't want to pay the performance penalty in this case.

          Show
          Alan Gates added a comment - After further thought I want to change my position on this. There are two cases to consider, when schema is present and when it isn't. The problem is by the time Pig is trying to access the missing field (in the backend), it has no idea whether the schema exists or not. So at runtime, Pig should just return a null if it gets ArrayOutOfBoundsException. How to pad missing data should be left up to the load function. Perhaps certain load functions do know how to pad missing data, or are ok with the pad at the end scheme proposed here. If the load function does not check, then Pig would effectively pad at the end, given the proposal above. If the load function implementer does not what this to happen, s/he can check each tuple being read from the input to assure it matches the schema, and then decide to pad the tuple with nulls, reject the tuple, or return a tuple full of nulls. In the case of PigStorage, checking each tuple for a match against the schema is too expensive. Ideally I would like it to, because I think that when the user gives a schema it's an error if the data doesn't match. But I don't want to pay the performance penalty in this case.
          Hide
          Daniel Dai added a comment -

          I am fine with throwing this record away and put a warning to the user. The key issue is not to introduce a tuple with less items in it. The follow-up operation depends on the consistency of the tuple size otherwise we will see strange errors which is very hard to diagnose.

          Show
          Daniel Dai added a comment - I am fine with throwing this record away and put a warning to the user. The key issue is not to introduce a tuple with less items in it. The follow-up operation depends on the consistency of the tuple size otherwise we will see strange errors which is very hard to diagnose.
          Hide
          Alan Gates added a comment -

          I don't think padding is a good idea. We don't know which field in the record is missing. We're just guessing that the last field is missing, when in fact it might be the first. Then we've made the situation worse by inserting invalid data in the all the fields.

          I think the loader should either throw the record out, or make all fields in the record null. This guarantees that we are not further propagating the error. Then a warning can be issued that the record was invalid (I'm assuming even in the above proposal the loader would issue a warning.)

          Show
          Alan Gates added a comment - I don't think padding is a good idea. We don't know which field in the record is missing. We're just guessing that the last field is missing, when in fact it might be the first. Then we've made the situation worse by inserting invalid data in the all the fields. I think the loader should either throw the record out, or make all fields in the record null. This guarantees that we are not further propagating the error. Then a warning can be issued that the record was invalid (I'm assuming even in the above proposal the loader would issue a warning.)
          Daniel Dai created issue -

            People

            • Assignee:
              Daniel Dai
              Reporter:
              Daniel Dai
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development