Pig
  1. Pig
  2. PIG-1341

BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.9.0
    • Component/s: impl
    • Labels:
      None

      Description

      Script reads in BinStorage data and tries to convert a column which is in DataByteArray to Chararray.

      raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
      --filter out null columns
      A = filter raw by col1#'bcookie' is not null;
      
      B = foreach A generate col1#'bcookie'  as reqcolumn;
      describe B;
      --B: {regcolumn: bytearray}
      X = limit B 5;
      dump X;
      
      B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
      describe B;
      --B: {convertedcol: chararray}
      X = limit B 5;
      dump X;
      
      

      The first dump produces:

      (36co9b55onr8s)
      (36co9b55onr8s)
      (36hilul5oo1q1)
      (36hilul5oo1q1)
      (36l4cj15ooa8a)

      The second dump produces:
      ()
      ()
      ()
      ()
      ()

      It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 time(s).
      Viraj

      1. PIG-1341.patch
        7 kB
        Richard Ding

        Issue Links

          Activity

          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12440415/PIG-1341.patch
          against trunk revision 929737.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/266/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/266/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/266/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12440415/PIG-1341.patch against trunk revision 929737. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/266/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/266/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/266/console This message is automatically generated.
          Hide
          Daniel Dai added a comment -

          Have a discussion with Alan and Richard, we felt that caster for BinStorage does not make sense. We don't know how to cast bytearray datatype for BinStorage. In the intermediate storage case, we will find the original loader, and use lineage for that loader to convert bytearray. But if user use the BinStorage directly, we have no idea what bytearray means. So the suggestion is we don't give caster to BinStorage. The implication is that if user want to use BinStorage as a temporary store, in some cases, it will fail.

          Here is a sample script which will be broken if we make this change:

          script 1:

          a = load '1.txt';
          b = order a by $0;
          store b into 'temp.out' using BinStorage(); -- store in BinStorage format with the datatype bytearray
          

          script 2:

          a = load 'temp.out' using BinStorage();
          b = foreach a generate $0+$1;   -- here we will need a caster, but BinStorage does not have it, we will fail
          
          Show
          Daniel Dai added a comment - Have a discussion with Alan and Richard, we felt that caster for BinStorage does not make sense. We don't know how to cast bytearray datatype for BinStorage. In the intermediate storage case, we will find the original loader, and use lineage for that loader to convert bytearray. But if user use the BinStorage directly, we have no idea what bytearray means. So the suggestion is we don't give caster to BinStorage. The implication is that if user want to use BinStorage as a temporary store, in some cases, it will fail. Here is a sample script which will be broken if we make this change: script 1: a = load '1.txt'; b = order a by $0; store b into 'temp.out' using BinStorage(); -- store in BinStorage format with the datatype bytearray script 2: a = load 'temp.out' using BinStorage(); b = foreach a generate $0+$1; -- here we will need a caster, but BinStorage does not have it, we will fail
          Hide
          Daniel Dai added a comment -

          Also if we doing that, BinStorage is not "a reliable way to dump data and load it, without having to explicitly list all the fields and figure out their parts" (though it is already not). I think we shall provide some way for this use case.

          Show
          Daniel Dai added a comment - Also if we doing that, BinStorage is not "a reliable way to dump data and load it, without having to explicitly list all the fields and figure out their parts" (though it is already not). I think we shall provide some way for this use case.
          Hide
          Daniel Dai added a comment -

          I am thinking about encoding lineage info into the BinStorage file header, so even after we "dump and load", we still maintain the lineage information.

          Show
          Daniel Dai added a comment - I am thinking about encoding lineage info into the BinStorage file header, so even after we "dump and load", we still maintain the lineage information.
          Hide
          Ashutosh Chauhan added a comment -

          I think BinStorage is an internal way of moving data around in Pig and it should be treated that way. I think we should discourage its usage to user. Otherwise, we need to add capabilities as the one requested here. Important impact of making such a change is that we can't then swap out BinStorage with other storage mechanisms. If Avro (or protobuf or whatever) proved to be a better replacement for BinStorage, then we cant just swap them in place of BinStorage, unless we add to them all the capabilities that BinStorage has. Therefore, I suggest to keep capabilities of BinStorage to minimal.

          Show
          Ashutosh Chauhan added a comment - I think BinStorage is an internal way of moving data around in Pig and it should be treated that way. I think we should discourage its usage to user. Otherwise, we need to add capabilities as the one requested here. Important impact of making such a change is that we can't then swap out BinStorage with other storage mechanisms. If Avro (or protobuf or whatever) proved to be a better replacement for BinStorage, then we cant just swap them in place of BinStorage, unless we add to them all the capabilities that BinStorage has. Therefore, I suggest to keep capabilities of BinStorage to minimal.
          Hide
          Alan Gates added a comment -

          I agree with Ashutosh. We do not want BinStorage tracking data lineage. In the case where Pig is using BinStorage (or whatever) for moving data between MR jobs then Pig can figure out the correct cast function to use and apply it. For cases such as the one here where users are storing data using BinStorage and then in a separate Pig Latin script reading it (and thus loosing the type information) it is the users responsibility to correctly cast the data before storing it in BinStorage. As a general case I do not think we can expect load and store functions to track data lineage across Pig Latin scripts.

          I propose we close this as won't fix.

          Show
          Alan Gates added a comment - I agree with Ashutosh. We do not want BinStorage tracking data lineage. In the case where Pig is using BinStorage (or whatever) for moving data between MR jobs then Pig can figure out the correct cast function to use and apply it. For cases such as the one here where users are storing data using BinStorage and then in a separate Pig Latin script reading it (and thus loosing the type information) it is the users responsibility to correctly cast the data before storing it in BinStorage. As a general case I do not think we can expect load and store functions to track data lineage across Pig Latin scripts. I propose we close this as won't fix.
          Hide
          Viraj Bhat added a comment -

          Hi Alan,
          I think we should stop exposing BinStorage to users otherwise this will keep happening.
          We should document this behavior of BinStorage if we are allowing users to use this load-store class.
          Viraj

          Show
          Viraj Bhat added a comment - Hi Alan, I think we should stop exposing BinStorage to users otherwise this will keep happening. We should document this behavior of BinStorage if we are allowing users to use this load-store class. Viraj
          Hide
          Alan Gates added a comment -

          We can mark BinStorage as deprecated if we want, though I don't think we should without a decent sequence file load/store function. But we can't remove it at the moment, as people use it.

          Show
          Alan Gates added a comment - We can mark BinStorage as deprecated if we want, though I don't think we should without a decent sequence file load/store function. But we can't remove it at the moment, as people use it.

            People

            • Assignee:
              Alan Gates
              Reporter:
              Viraj Bhat
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development