Pig
  1. Pig
  2. PIG-496

project of bags from complex data causes failures

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      A = load 'complex data' as (x: bag{});
      B = foreach A generate x.($1, $2);

      produces stack trace:

      2008-10-14 15:11:07,639 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error message from task (reduce) task_200809241441_9923_r_000000java.lang.NullPointerException
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:215)
      at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:166)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:252)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:222)
      at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:134)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
      at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

      Pradeep suspects that the problem is in src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POProject.java; line 374

        Activity

        Hide
        Alan Gates added a comment -

        If you run a script like the above now (on version 0.7) it does not fail, but instead gives an error message "ERROR 1026: Attempt to fetch field 0 from schema of size 0" This is at least a decent error message. The problem now is that we allow positional notation to work in cases where the schema is undefined, which it is when you say bag{}. So $0 should work.

        Show
        Alan Gates added a comment - If you run a script like the above now (on version 0.7) it does not fail, but instead gives an error message "ERROR 1026: Attempt to fetch field 0 from schema of size 0" This is at least a decent error message. The problem now is that we allow positional notation to work in cases where the schema is undefined, which it is when you say bag{}. So $0 should work.
        Hide
        Daniel Dai added a comment -

        We need to decide how to load empty bag, eg.

        A = load 'data.txt' as (x: bag{});
        

        Currently, we load x as bag, inside x we don't do any interpretation. So what we load is a bag of bytearrays.

        This however cause problem when we do further processing for this bag. Assume in data.txt, the bag actually contains three item tuples:

        B = foreach A generate x.($1, $2); 
        

        We expect it will project 2nd, 3th field of the tuple. But in current code, x is a bag of one field bytearray, this results an error

        B = foreach A generate flatten x;
        

        We expect it will flatten x into 3 fields. But in current code, we cannot even flatten x, since x does not contain tuple.

        The problem stems in two sources:
        1. Currently bag requires tuple in some cases, but not require tuple in other cases. This is inconsistent. We should make it a rule. So when we load a bag, actually means load a bag of tuples

        2. When we load a tuple with unknown number of fields (tuple inner schema is unknown), we assume it contains only one bytearray field. However, it is not possible to cast one byte field to multiple fields later. Recall when we load a file with unknown schema:

        A = load 'data.txt';
        

        We actually load multiple fields seperated by delimit, each field is of type bytearray. When we load empty bag, we can mimic this behavior.

        So I propose two changes:
        1. Load a bag implies loading a bag of tuples, even when bag inner schema is empty.
        2. When we convert bytearray to tuple with no inner schema, we no longer assume one field. We will take comma as delimit (in the case of UTF8StorageConverter) and produce a tuple of multiple bytearray fields.

        Assume data.txt is:

        {(1,2,3),(4,5,6)}
        After this change,
        A = load 'data.txt' as (x: bag{});
        describe A:
        We get: bag{}
        dump A:
        We get: {(1,2,3),(4,5,6)}

        , which is not a bag of byteArrays, but a bag of three item tuples.

        Show
        Daniel Dai added a comment - We need to decide how to load empty bag, eg. A = load 'data.txt' as (x: bag{}); Currently, we load x as bag, inside x we don't do any interpretation. So what we load is a bag of bytearrays. This however cause problem when we do further processing for this bag. Assume in data.txt, the bag actually contains three item tuples: B = foreach A generate x.($1, $2); We expect it will project 2nd, 3th field of the tuple. But in current code, x is a bag of one field bytearray, this results an error B = foreach A generate flatten x; We expect it will flatten x into 3 fields. But in current code, we cannot even flatten x, since x does not contain tuple. The problem stems in two sources: 1. Currently bag requires tuple in some cases, but not require tuple in other cases. This is inconsistent. We should make it a rule. So when we load a bag, actually means load a bag of tuples 2. When we load a tuple with unknown number of fields (tuple inner schema is unknown), we assume it contains only one bytearray field. However, it is not possible to cast one byte field to multiple fields later. Recall when we load a file with unknown schema: A = load 'data.txt'; We actually load multiple fields seperated by delimit, each field is of type bytearray. When we load empty bag, we can mimic this behavior. So I propose two changes: 1. Load a bag implies loading a bag of tuples, even when bag inner schema is empty. 2. When we convert bytearray to tuple with no inner schema, we no longer assume one field. We will take comma as delimit (in the case of UTF8StorageConverter) and produce a tuple of multiple bytearray fields. Assume data.txt is: {(1,2,3),(4,5,6)} After this change, A = load 'data.txt' as (x: bag{}); describe A: We get: bag{} dump A: We get: {(1,2,3),(4,5,6)} , which is not a bag of byteArrays, but a bag of three item tuples.
        Hide
        Olga Natkovich added a comment -

        This looks good with on modification - the fields don't have to be bytearrays - they can be of any type

        Show
        Olga Natkovich added a comment - This looks good with on modification - the fields don't have to be bytearrays - they can be of any type
        Hide
        Daniel Dai added a comment -

        PIG-496-1.patch is depended on PIG-730. Otherwise there will be frontend exception.

        Show
        Daniel Dai added a comment - PIG-496 -1.patch is depended on PIG-730 . Otherwise there will be frontend exception.
        Hide
        Richard Ding added a comment -

        +1

        Show
        Richard Ding added a comment - +1
        Hide
        Daniel Dai added a comment -

        Review notes:
        https://reviews.apache.org/r/272/

        Patch committed to trunk.

        Show
        Daniel Dai added a comment - Review notes: https://reviews.apache.org/r/272/ Patch committed to trunk.

          People

          • Assignee:
            Daniel Dai
            Reporter:
            Olga Natkovich
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development