Pig
  1. Pig
  2. PIG-2259

Black hole of multiple level dereference on "bag in bag" structure: cannot reach deeper levels

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
    • Environment:

      Pig 0.9.0 local version, on Linux x86 and Mac OS X 10.7.1

    • Tags:
      dereference, bag

      Description

      I noticed that dereference cannot reach the second level of bag in a "bag in bag" structure. Here is a example:

      For the following scripts:

      a = load 'grade.dat' as (name, age, gpa);
      b = load 'rate.dat' as (state, age, rate);
      ag = group a by (name, age);
      c = cogroup ag by group.age, b by age;
      cf = foreach c generate $1.$0;

      The relation c has the schema as:

      bytearray, bag{tuple(tuple(bytearray, bytearray), bag

      {tuple(bytearray, bytearray, bytearray)})}, bag{tuple(bytearray, bytearray, bytearray)}

      so for c, $1.$0 means the first field of the bag "ag", which will be the tuple group(name, age). However after this, $1.$0.$0 and $1.$0.$0.$0 keep the same tuple but no deeper dereference. Actually we can add arbitrary number of ".$0" after $1.$0 but keep stay at the same position.

      The reason for this interesting "black hole" of the dereference is when we dereferencing a bag, we automatically create another bag structure, so after we obtain the "group(name, age)" tuple from the bag "ag", a bag wrapper is added onto the tuple so it becomes

      bag

      {tuple(tuple(bytearray, bytearray))}

      Then no matter how many dereferences are appended, this structure cannot be changed since every dereference just "takes off" the outer bag wrapper and "puts on" the same bag wrapper.

      For the same reason, the following script can also produce the same "black hole":

      cf = foreach c generate $1.$1.$0. ... (arbitrary number of ".$0")

        Activity

        Hide
        Daniel Dai added a comment -

        The major problem here is we don't have a way to slice a bag horizontally. Consider the following bag:
        bag:{(1,

        {(a),(b)})
        (2,{(c),(d)})}
        We can only slice the bag vertically:
        bag.$0, we get all first elements of the bag. And apparently, the resulting data structure can only be bag: {(1),(2)}
        Similarly, bag.$1:
        {({(a),(b)}

        ),(

        {(c),(d)}

        )

        I guess what you want is the ability to access in individual cell inside a bag. This require the ability to slice the bag horizontally, such as bag[0]=(1,

        {(a),(b)}), then you can refer bag[0].$1={(a),(b)}

        . However, this is not bag designed to be. A bag is a collection of tuples in which no order is defined, so you can only iterate through a bag. You can access tuples inside a bag by a custom UDF. I don't know how to provide something in semantic level to access a specific tuple inside a bag. I would suggest provide more buildin UDFs for bag processing, such as:
        1. GetTupleInBag(bag, i), get ith tuple
        2. GetFirstTupleWithValue(bag, j, value), get first tuple which carry "key" as its jth column
        Both UDF need to iterate through the bag to get the specific elelment, the time complexity is O

        Show
        Daniel Dai added a comment - The major problem here is we don't have a way to slice a bag horizontally. Consider the following bag: bag:{(1, {(a),(b)}) (2,{(c),(d)})} We can only slice the bag vertically: bag.$0, we get all first elements of the bag. And apparently, the resulting data structure can only be bag: {(1),(2)} Similarly, bag.$1: {({(a),(b)} ),( {(c),(d)} ) I guess what you want is the ability to access in individual cell inside a bag. This require the ability to slice the bag horizontally, such as bag [0] =(1, {(a),(b)}), then you can refer bag [0] .$1={(a),(b)} . However, this is not bag designed to be. A bag is a collection of tuples in which no order is defined, so you can only iterate through a bag. You can access tuples inside a bag by a custom UDF. I don't know how to provide something in semantic level to access a specific tuple inside a bag. I would suggest provide more buildin UDFs for bag processing, such as: 1. GetTupleInBag(bag, i), get ith tuple 2. GetFirstTupleWithValue(bag, j, value), get first tuple which carry "key" as its jth column Both UDF need to iterate through the bag to get the specific elelment, the time complexity is O
        Hide
        JArod Wen added a comment -

        Thanks Daniel for your comments. Besides the case of slice a bag horizontally, another way of thinking about "dereference a bag within a bag" may lead to the logical of flatting a nested bag. Since bag is a unordered set of tuples, when all tuples inside have the same schema, and one of the fields is a bag field, it should be doable to extract the fields of the inner bag.

        For example, using an example extending the one you have provided:

        bag: {(1,

        {(a, 0.3), (b, 0.4)}

        ), (2,

        {(c, 0.5), (d, 0.6)}

        )}.

        The dereference of bag.$1.$0 may have the output of

        new_bag: {(

        {(a), (b)}

        ), (

        {(c), (d)}

        )}.

        So here the order still does not matter. This should be different from a horizontally where the order really matters. How do you think?

        Show
        JArod Wen added a comment - Thanks Daniel for your comments. Besides the case of slice a bag horizontally, another way of thinking about "dereference a bag within a bag" may lead to the logical of flatting a nested bag. Since bag is a unordered set of tuples, when all tuples inside have the same schema, and one of the fields is a bag field, it should be doable to extract the fields of the inner bag. For example, using an example extending the one you have provided: bag: {(1, {(a, 0.3), (b, 0.4)} ), (2, {(c, 0.5), (d, 0.6)} )}. The dereference of bag.$1.$0 may have the output of new_bag: {( {(a), (b)} ), ( {(c), (d)} )}. So here the order still does not matter. This should be different from a horizontally where the order really matters. How do you think?
        Hide
        Jonathan Coveney added a comment -

        I actually think I get what Jarod means, and agree. Let's say you have a bag

        b:bag{t:tuple(x:int, b:bag

        {t:tuple(a:int,b:int,c:int)}

        )}

        It'd be nice to be able to do
        b.$0.$1 in order to grab that inner bag. You could, alternately, do b.$0, flatten it, then access the $0 field, but that is way more clunky.

        I'll look around and see how hard this would be too do (probably not terribly difficult), the question is more whether we should support this (and I would say we should).

        Show
        Jonathan Coveney added a comment - I actually think I get what Jarod means, and agree. Let's say you have a bag b:bag{t:tuple(x:int, b:bag {t:tuple(a:int,b:int,c:int)} )} It'd be nice to be able to do b.$0.$1 in order to grab that inner bag. You could, alternately, do b.$0, flatten it, then access the $0 field, but that is way more clunky. I'll look around and see how hard this would be too do (probably not terribly difficult), the question is more whether we should support this (and I would say we should).
        Hide
        Daniel Dai added a comment -

        It is semantically right if this involves a flatten. Then we need to limit the usage in foreach, since this is the only operator has the notion flatten. I am a little worry about people may misuse it, but I am open to it.

        Show
        Daniel Dai added a comment - It is semantically right if this involves a flatten. Then we need to limit the usage in foreach, since this is the only operator has the notion flatten. I am a little worry about people may misuse it, but I am open to it.
        Hide
        JArod Wen added a comment -

        Actually when I am rethinking about this problem now, I am preferring Daniel's opinion.

        This may be a question of whether we can assume that the bag is a typed bag or not. In general case, no assumption can be made to the schema within the bag, then in order to get inside of the bag of bag, flatten() is necessary.

        However if the parser knows that it is a typed bag, b.$0.$1 should be preferred.

        Show
        JArod Wen added a comment - Actually when I am rethinking about this problem now, I am preferring Daniel's opinion. This may be a question of whether we can assume that the bag is a typed bag or not. In general case, no assumption can be made to the schema within the bag, then in order to get inside of the bag of bag, flatten() is necessary. However if the parser knows that it is a typed bag, b.$0.$1 should be preferred.

          People

          • Assignee:
            Unassigned
            Reporter:
            JArod Wen
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development