Pig
  1. Pig
  2. PIG-2290

TOBAG wraps tuple parameters in another tuple

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.10.0
    • Component/s: internal-udfs
    • Labels:
      None
    • Patch Info:
      Patch Available
    • Release Note:
      If the argument to TOBAG is a tuple, it does not get inserted into another tuple before being added to the bag.

      Description

      The TOBAG function indiscriminately wraps all parameters in a tuple. When I pass a list of tuples to the function, I would expect it to return a bag containing those tuples. Instead, it returns a bag containing single element tuples, where each tuple contains one of the tuples passed in.

      Example:

      tuples.txt
      (mike,608)
      (ryan,11624)
      (justin,2317)
      
      Demonstration using pig 0.9.0
      grunt> TUPLE_DATA = LOAD 'tuples.txt' AS (T:tuple(name:chararray,street_number:int));
      grunt> BAGGED = FOREACH TUPLE_DATA GENERATE TOBAG(T);
      grunt> DESCRIBE BAGGED;
      BAGGED: {{(name: chararray,street_number: int)}}
      grunt> DUMP BAGGED;
      ({((mike,608))})
      ({((ryan,11624))})
      ({((justin,2317))})
      
      1. pig-2290.patch
        3 kB
        Ryan Hoegg
      2. pig-2290.1.patch
        4 kB
        Thejas M Nair

        Activity

        Hide
        Thejas M Nair added a comment -

        +1 . Patch committed to trunk.
        Thanks for the contribution Ryan!

        Show
        Thejas M Nair added a comment - +1 . Patch committed to trunk. Thanks for the contribution Ryan!
        Hide
        Thejas M Nair added a comment -

        pig-2290.1.patch - fix patch format issue, updated javadoc.

        Show
        Thejas M Nair added a comment - pig-2290.1.patch - fix patch format issue, updated javadoc.
        Hide
        Thejas M Nair added a comment -

        I didn't notice that the output does not match schema. That means one can't use the output of the TOBAG later in the query. So fixing this is not likely to have an impact on most of the existing queries. It will affect only some cases where the output of TOBAG is stored without further processing on it into a storage that does not store schema.
        But this use case is something that users would need to check for, when the upgrade. I would vote for making this change only in 0.10, and not for 0.9.1 .

        Show
        Thejas M Nair added a comment - I didn't notice that the output does not match schema. That means one can't use the output of the TOBAG later in the query. So fixing this is not likely to have an impact on most of the existing queries. It will affect only some cases where the output of TOBAG is stored without further processing on it into a storage that does not store schema. But this use case is something that users would need to check for, when the upgrade. I would vote for making this change only in 0.10, and not for 0.9.1 .
        Hide
        Ryan Hoegg added a comment -

        This is probably true, but the current implementation behaves inconsistently (output does not match schema). Plus, it makes it impossible to generate a bag containing tuples with more than one element.

        Is 0.9 -> 0.10 considered a "major version upgrade"?

        Show
        Ryan Hoegg added a comment - This is probably true, but the current implementation behaves inconsistently (output does not match schema). Plus, it makes it impossible to generate a bag containing tuples with more than one element. Is 0.9 -> 0.10 considered a "major version upgrade"?
        Hide
        Thejas M Nair added a comment -

        This change is not backward compatible. It can break existing pig queries.
        It can be argued that the TOBAG current implementation has a correct/consistent behavior - it puts each argument into a tuple and adds it to a bag. The bag always contains tuple with single element.

        I think there has to be a very strong reason to break backward compatibility. This could go into a new UDF (say TOBAG_2 ?). Or this could go into some major version upgrade of pig where we make bunch of non backward compatible changes.

        Show
        Thejas M Nair added a comment - This change is not backward compatible. It can break existing pig queries. It can be argued that the TOBAG current implementation has a correct/consistent behavior - it puts each argument into a tuple and adds it to a bag. The bag always contains tuple with single element. I think there has to be a very strong reason to break backward compatibility. This could go into a new UDF (say TOBAG_2 ?). Or this could go into some major version upgrade of pig where we make bunch of non backward compatible changes.
        Hide
        Ryan Hoegg added a comment -

        Fixed formatting in TOBAG.java

        Show
        Ryan Hoegg added a comment - Fixed formatting in TOBAG.java
        Hide
        Dmitriy V. Ryaboy added a comment -

        Looks good, though the formatting is a little off:

        +                } else {
                         Tuple tp2 = TupleFactory.getInstance().newTuple(1);
                         tp2.set(0, object);
                         bag.add(tp2);
                     }
        +            }
        

        The else block is not indented properly.

        Ok to put in trunk or do you feel this must go into 0.9.1?

        Show
        Dmitriy V. Ryaboy added a comment - Looks good, though the formatting is a little off: + } else { Tuple tp2 = TupleFactory.getInstance().newTuple(1); tp2.set(0, object); bag.add(tp2); } + } The else block is not indented properly. Ok to put in trunk or do you feel this must go into 0.9.1?
        Hide
        Ryan Hoegg added a comment -

        Patch to TOBAG function to check if the parameter is a tuple before creating one, including unit test

        Show
        Ryan Hoegg added a comment - Patch to TOBAG function to check if the parameter is a tuple before creating one, including unit test

          People

          • Assignee:
            Ryan Hoegg
            Reporter:
            Ryan Hoegg
          • Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development