Pig
  1. Pig
  2. PIG-847

Setting twoLevelAccessRequired field in a bag schema should not be required to access fields in the tuples of the bag

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      In outputSchema method of UDF, user no longer need to set two level access flag for a bag schema. Pig only do vertical slice of a bag. User can refer to a particular field of a bag, but there is no way to refer a particular row of a bag.

      Description

      Currently Pig interprets the result type of a relation as a bag. However the schema of the relation directly contains the schema describing the fields in the tuples for the relation. However when a udf wants to return a bag or if there is a bag in input data or if the user creates a bag constant, the schema of the bag has one field schema which is that of the tuple. The Tuple's schema has the types of the fields. To be able to access the fields from the bag directly in such a case by using something like <bagname>.<fieldname> or <bag>.<fieldposition>, the schema of the bag should have the twoLevelAccess set to true so that pig's type system can get traverse the tuple schema and get to the field in question. This is confusing - we should try and see if we can avoid needing this extra flag. A possible solution is to treat bags the same way - whether they represent relations or real bags. Another way is to introduce a special "relation" datatype for the result type of a relation and bag type would be used only for true bags. In this case, we would always need bag schema to have a tuple schema which would describe the fields.

        Issue Links

          Activity

          Hide
          Daniel Dai added a comment -

          After code review and prototyping, I don't think we need twoLevelAccess (Schema.twoLevelAccessRequired). The reasons are:
          1. twoLevelAccess only exist in logical layer. In the physical layer, we don't have any notion of twoLevelAccess. No matter what the value of twoLevelAccess is, we will generate the same physical plan.
          2. We do two level access for all bag access. I don't find any case we want to access the enclosing tuple of the bag directly.

          Here is one example. Suppose we have a UDF which generate a bag:

          class GenBag extends EvalFunc<DataBag> {
              @Override
              public DataBag exec(Tuple input) throws IOException {
                  DataBag result = DefaultBagFactory.getInstance().newDefaultBag();
                  Tuple t = DefaultTupleFactory.getInstance().newTuple();
                  t.append(input.get(0));
                  t.append(((Integer)input.get(0))*((Integer)input.get(0)));
              }
              @Override
              public Schema outputSchema(Schema input) {
                  try {
                      Schema tupleSchema = new Schema();
                      for (int i=0;i<2;i++)
                          tupleSchema.add(new FieldSchema(input.getField(0).alias, null, DataType.INTEGER));
                      Schema bagSchema = new Schema();
                      bagSchema.add(new FieldSchema(null, tupleSchema, DataType.TUPLE));
                      bagSchema.setTwoLevelAccessRequired(false); // Play with twoLevelAccess
                      return new Schema(new FieldSchema(this.getClass().getSimpleName(), bagSchema, DataType.BAG));
                  } catch (FrontendException e) {
                      return null;
                  }
          }
          

          If we have a script:

          a = load '1.txt' as (a0:int, a1:int);
          b = foreach a generate GenBag(a0, a1) as bg;
          c = foreach b generate bg.$0;
          dump c;
          

          The goal for twoLevelAccess seems to control the meaning of bg.$0: Whether it means tuple or the first field of tuple. However, in reality, we only see user project the item inside tuple. Actually, in current code, even if we set twoLevelAccess to false, we still cannot project the tuple. So keep twoLevelAccess is meaningless and confusing. I propose to remove twoLevelAccess, all bag implicitly contain tuple, and bag projection implicitly goes to the item inside tuple.

          Show
          Daniel Dai added a comment - After code review and prototyping, I don't think we need twoLevelAccess (Schema.twoLevelAccessRequired). The reasons are: 1. twoLevelAccess only exist in logical layer. In the physical layer, we don't have any notion of twoLevelAccess. No matter what the value of twoLevelAccess is, we will generate the same physical plan. 2. We do two level access for all bag access. I don't find any case we want to access the enclosing tuple of the bag directly. Here is one example. Suppose we have a UDF which generate a bag: class GenBag extends EvalFunc<DataBag> { @Override public DataBag exec(Tuple input) throws IOException { DataBag result = DefaultBagFactory.getInstance().newDefaultBag(); Tuple t = DefaultTupleFactory.getInstance().newTuple(); t.append(input.get(0)); t.append((( Integer )input.get(0))*(( Integer )input.get(0))); } @Override public Schema outputSchema(Schema input) { try { Schema tupleSchema = new Schema(); for ( int i=0;i<2;i++) tupleSchema.add( new FieldSchema(input.getField(0).alias, null , DataType.INTEGER)); Schema bagSchema = new Schema(); bagSchema.add( new FieldSchema( null , tupleSchema, DataType.TUPLE)); bagSchema.setTwoLevelAccessRequired( false ); // Play with twoLevelAccess return new Schema( new FieldSchema( this .getClass().getSimpleName(), bagSchema, DataType.BAG)); } catch (FrontendException e) { return null ; } } If we have a script: a = load '1.txt' as (a0: int , a1: int ); b = foreach a generate GenBag(a0, a1) as bg; c = foreach b generate bg.$0; dump c; The goal for twoLevelAccess seems to control the meaning of bg.$0: Whether it means tuple or the first field of tuple. However, in reality, we only see user project the item inside tuple. Actually, in current code, even if we set twoLevelAccess to false, we still cannot project the tuple. So keep twoLevelAccess is meaningless and confusing. I propose to remove twoLevelAccess, all bag implicitly contain tuple, and bag projection implicitly goes to the item inside tuple.
          Hide
          Ashutosh Chauhan added a comment -

          I propose to remove twoLevelAccess, all bag implicitly contain tuple, and bag projection implicitly goes to the item inside tuple.

          +1 for removal of twoLevelAccess and all the confusion it results in. Will this decision has any bearing on bags having other types? People have suggested for having a datatype for a collection of objects (like integer, long etc.) If we mandate that bags necessarily contain tuples, are we eliminating the possibility of implementing bags containing other types?

          Show
          Ashutosh Chauhan added a comment - I propose to remove twoLevelAccess, all bag implicitly contain tuple, and bag projection implicitly goes to the item inside tuple. +1 for removal of twoLevelAccess and all the confusion it results in. Will this decision has any bearing on bags having other types? People have suggested for having a datatype for a collection of objects (like integer, long etc.) If we mandate that bags necessarily contain tuples, are we eliminating the possibility of implementing bags containing other types?
          Hide
          Alan Gates added a comment -

          I'm 100% behind removing twoLevelAccess, but I don't want to break compatibility. Rather than removing calls like Schema.isTwoLevelAccessRequiredwe should mark them as deprecated and make them do nothing.

          Show
          Alan Gates added a comment - I'm 100% behind removing twoLevelAccess, but I don't want to break compatibility. Rather than removing calls like Schema.isTwoLevelAccessRequiredwe should mark them as deprecated and make them do nothing.
          Hide
          Daniel Dai added a comment -

          Yes, I will mark Schema.isTwoLevelAccessRequiredwe deprecate rather than remove it. This is exactly what Santhosh suggested as well.

          Show
          Daniel Dai added a comment - Yes, I will mark Schema.isTwoLevelAccessRequiredwe deprecate rather than remove it. This is exactly what Santhosh suggested as well.
          Hide
          Daniel Dai added a comment -

          PIG-847-1.patch remove twoLevelAccess flag in new logical plan.

          Show
          Daniel Dai added a comment - PIG-847 -1.patch remove twoLevelAccess flag in new logical plan.
          Hide
          Daniel Dai added a comment -
          Show
          Daniel Dai added a comment - Review request: https://reviews.apache.org/r/361/
          Hide
          Richard Ding added a comment -

          +1.

          Notice schema display change between 0.8 and 0.9 for group/cogroup:

          0.8:

          b: {group: int,a: {a0: int,a1: int,a2: int}}
          

          0.9:

          b: {group: int,a: {(a0: int,a1: int,a2: int)}}
          
          Show
          Richard Ding added a comment - +1. Notice schema display change between 0.8 and 0.9 for group/cogroup: 0.8: b: {group: int ,a: {a0: int ,a1: int ,a2: int }} 0.9: b: {group: int ,a: {(a0: int ,a1: int ,a2: int )}}
          Hide
          Daniel Dai added a comment -

          Patch committed to trunk.

          Show
          Daniel Dai added a comment - Patch committed to trunk.

            People

            • Assignee:
              Daniel Dai
              Reporter:
              Pradeep Kamath
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development