Pig
  1. Pig
  2. PIG-1745

Disable converting bytes loading from BinStorage

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.9.0
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      1. Converting bytes loading from BinStorage() will now result an error.
      2. If user clearly knows the caster for the BinStorage, he/she can pass the caster class name to the construct of BinStorage. For example, if data is load from PigStorage (or other LoadFunc using Utf8StorageConverter) and store using BinStorage, user can pass "org.apache.pig.builtin.Utf8StorageConverter" to the construct of BinStorage (or pass "Utf8StorageConverter", since we will search for "org.apache.pig.builtin" package by default). By doing this, converting bytes to other type will still work.
      Show
      1. Converting bytes loading from BinStorage() will now result an error. 2. If user clearly knows the caster for the BinStorage, he/she can pass the caster class name to the construct of BinStorage. For example, if data is load from PigStorage (or other LoadFunc using Utf8StorageConverter) and store using BinStorage, user can pass "org.apache.pig.builtin.Utf8StorageConverter" to the construct of BinStorage (or pass "Utf8StorageConverter", since we will search for "org.apache.pig.builtin" package by default). By doing this, converting bytes to other type will still work.

      Description

      If we load bytes from BinStorage, we don't actually know how we get these bytes originally, and we will not have a way to cast those bytes. Ideally we shall encode caster into BinStorage data file, but we are not there yet. Currrently bytesToXXX methods for BinStorage is wrong and it results unexpected errors. Eg.

      a = load '1.txt' as (a0, a1, a2);
      store a into '1.bin' as BinStorage();
      
      a = load '1.bin' using BinStorage as (a0, a1, a2);
      b = foreach a generate (long)a0;
      dump b;
      

      The code will run but produce wrong data. It's less confusing if we throw an exception in this case.

      Release Notes:
      Pig will throw exception in the case we want to convert bytes loading from BinStorage

      1. PIG-1745-3.patch
        13 kB
        Daniel Dai
      2. PIG-1745-2.patch
        13 kB
        Daniel Dai
      3. PIG-1745-1.patch
        9 kB
        Daniel Dai

        Issue Links

          Activity

          Hide
          Dmitriy V. Ryaboy added a comment -

          Daniel,
          check out how I addressed a very similar problem in the HBase loader – I have a default caster, and allow a user to specify one using a constructor if necessary. I think that's cleaner than adding and extra storage class.

          https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java

          Show
          Dmitriy V. Ryaboy added a comment - Daniel, check out how I addressed a very similar problem in the HBase loader – I have a default caster, and allow a user to specify one using a constructor if necessary. I think that's cleaner than adding and extra storage class. https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java
          Hide
          Daniel Dai added a comment -

          Yes, I think it is good, in that we can also allow other caster. But different from HBaseStorage, I don't want to make UTF8StorageConverter special, we require full qualified caster class name in the constructor parameter, unless this class is in org.apache.pig.builtin.

          Show
          Daniel Dai added a comment - Yes, I think it is good, in that we can also allow other caster. But different from HBaseStorage, I don't want to make UTF8StorageConverter special, we require full qualified caster class name in the constructor parameter, unless this class is in org.apache.pig.builtin.
          Hide
          Daniel Dai added a comment -

          Attach another patch addressing comments from Dmitriy and Thejas.

          Show
          Daniel Dai added a comment - Attach another patch addressing comments from Dmitriy and Thejas.
          Hide
          Thejas M Nair added a comment -

          +1
          please commit after test-patch and unit tests succeed.

          Show
          Thejas M Nair added a comment - +1 please commit after test-patch and unit tests succeed.
          Hide
          Daniel Dai added a comment -

          test-patch:
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 2 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          Unit test:
          pass

          end-to-end test:
          pass

          Show
          Daniel Dai added a comment - test-patch: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 2 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Unit test: pass end-to-end test: pass
          Hide
          Mridul Muralidharan added a comment -

          Since this is a backwardly incompatible change, some release notes indicating how users can continue to rely on earlier behavior (when the data being casted is indeed a long in the example in desc) would be good - assuming it is still possible to do so.
          If it is totally removed as an option, it is very unfortunate and drastically diminishes the value of BinStorage.

          Show
          Mridul Muralidharan added a comment - Since this is a backwardly incompatible change, some release notes indicating how users can continue to rely on earlier behavior (when the data being casted is indeed a long in the example in desc) would be good - assuming it is still possible to do so. If it is totally removed as an option, it is very unfortunate and drastically diminishes the value of BinStorage.
          Hide
          Daniel Dai added a comment -

          Since this is a backwardly incompatible change, some release notes indicating how users can continue to rely on earlier behavior (when the data being casted is indeed a long in the example in desc) would be good - assuming it is still possible to do so.

          Yes, I moved these Jiras to the right section in CHANGES.txt

          If it is totally removed as an option, it is very unfortunate and drastically diminishes the value of BinStorage.

          What do you mean "removed as an option"? This option will not be removed.

          Show
          Daniel Dai added a comment - Since this is a backwardly incompatible change, some release notes indicating how users can continue to rely on earlier behavior (when the data being casted is indeed a long in the example in desc) would be good - assuming it is still possible to do so. Yes, I moved these Jiras to the right section in CHANGES.txt If it is totally removed as an option, it is very unfortunate and drastically diminishes the value of BinStorage. What do you mean "removed as an option"? This option will not be removed.
          Hide
          Olga Natkovich added a comment -

          Just to clarify what is happenning (and we are adding information to the 0.9 documentation)

          The way to get previous functionality is to specify a converter for BinStorage to use to do the casts:

          a = load 'g/part*' using BinStorage('Utf8StorageConverter') as (id, d:bag

          {t:(v, s)}

          );
          b = foreach a generate (double)id, flatten(d);
          dump b;

          The UTF8StorageConverter is provided by default in Pig.

          We are require users to specify the converter explicitely to make sure that wrong results are not returned in case the data is not in the format UTF8Converter can understand.

          Show
          Olga Natkovich added a comment - Just to clarify what is happenning (and we are adding information to the 0.9 documentation) The way to get previous functionality is to specify a converter for BinStorage to use to do the casts: a = load 'g/part*' using BinStorage('Utf8StorageConverter') as (id, d:bag {t:(v, s)} ); b = foreach a generate (double)id, flatten(d); dump b; The UTF8StorageConverter is provided by default in Pig. We are require users to specify the converter explicitely to make sure that wrong results are not returned in case the data is not in the format UTF8Converter can understand.
          Hide
          Mridul Muralidharan added a comment -

          @Daniel Dai: By "removed as an option", I meant whether the ability to cast to arbitrary schema was removed or not. Olga clarified that.

          @Olga : Glad to see there is a way to achieve the functionality I was looking for.
          But passing classname as a parameter seems fragile behavior if the intention is to get an instance of LoadCaster.
          How would we handle cases for other loaders which require similar functionality (just as with BinStorage, it is not practical to assume all loaders can implement LoadCaster 'well') ? Mirror similar idiom ?
          Exposing implementation dependent constructor arguments for things like this is not extensible in long run if the deficiency is due to pig interface requirements. A more general way to plug it in would be better (and more easily verifiable by pig).

          Show
          Mridul Muralidharan added a comment - @Daniel Dai: By "removed as an option", I meant whether the ability to cast to arbitrary schema was removed or not. Olga clarified that. @Olga : Glad to see there is a way to achieve the functionality I was looking for. But passing classname as a parameter seems fragile behavior if the intention is to get an instance of LoadCaster. How would we handle cases for other loaders which require similar functionality (just as with BinStorage, it is not practical to assume all loaders can implement LoadCaster 'well') ? Mirror similar idiom ? Exposing implementation dependent constructor arguments for things like this is not extensible in long run if the deficiency is due to pig interface requirements. A more general way to plug it in would be better (and more easily verifiable by pig).
          Hide
          Dmitriy V. Ryaboy added a comment -

          Mridul: the same idiom is used in HBaseStorage.

          Can you suggest a better idiom or interface?

          Show
          Dmitriy V. Ryaboy added a comment - Mridul: the same idiom is used in HBaseStorage. Can you suggest a better idiom or interface?
          Hide
          Mridul Muralidharan added a comment -

          As of now, as we have more loaders - sherpa, fish, hive, etc - which require similar functionality which is already duplicated by BinStorage, hbase, etc : all of them will end up exposing details of this nature directly to the user : that too in ways which are validated differently by each (not to mention duplication of code).

          load and store func's are getting overloaded with functionalitywhich is not directly (but cosmetically) related to their operation : like in this case, where we tie casting to the loader - but configure it "externally" with the loader acting as a delegate for it.
          Instead of each loader reinventing and exposing similar (or different) idioms, would be better to abstract this away into pig : either as a grammar construct or as something else with reasonable smart default provided by the loader (getDefaultLoadCaster instead of getLoadCaster ?).

          Show
          Mridul Muralidharan added a comment - As of now, as we have more loaders - sherpa, fish, hive, etc - which require similar functionality which is already duplicated by BinStorage, hbase, etc : all of them will end up exposing details of this nature directly to the user : that too in ways which are validated differently by each (not to mention duplication of code). load and store func's are getting overloaded with functionalitywhich is not directly (but cosmetically) related to their operation : like in this case, where we tie casting to the loader - but configure it "externally" with the loader acting as a delegate for it. Instead of each loader reinventing and exposing similar (or different) idioms, would be better to abstract this away into pig : either as a grammar construct or as something else with reasonable smart default provided by the loader (getDefaultLoadCaster instead of getLoadCaster ?).

            People

            • Assignee:
              Daniel Dai
              Reporter:
              Daniel Dai
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development