Pig
  1. Pig
  2. PIG-153

Incorrect results when there is a dump in between statements.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Pig + Hadoop

    • Patch Info:
      Patch Available

      Description

      Following scenario is with Pig + Hadoop.

      A similar run with Local Pig showed correct results.

      Here is test file data/test/test2.txt:

      a1 1 5700
      b1 2 2001
      c2 2

      I run the following script step by step:

      grunt> a = load 'data/test/test2.txt';
      grunt> dump a;
      2008-03-18 06:41:55,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job — –
      2008-03-18 06:41:55,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input: [/user/amiry/dat a/test/test2.txt:org.apache.pig.builtin.PigStorage()]
      2008-03-18 06:41:55,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
      2008-03-18 06:41:55,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
      2008-03-18 06:41:55,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
      2008-03-18 06:41:55,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
      2008-03-18 06:41:55,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output: /tmp/temp135967 7959/tmp-246846292:org.apache.pig.builtin.BinStorage
      2008-03-18 06:41:55,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
      2008-03-18 06:41:55,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
      2008-03-18 06:41:55,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism: -1
      2008-03-18 06:41:57,472 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - Pig progress = 0%
      2008-03-18 06:41:58,477 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - Pig progress = 50%
      2008-03-18 06:42:04,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - Pig progress = 100%
      (a1, 1, 5700)
      (b1, 2, 2001)
      (c2, 2, )
      grunt> b = filter a by $0 eq 'a1';
      grunt> dump b;
      2008-03-18 06:42:23,881 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job — –
      2008-03-18 06:42:23,881 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input: [/tmp/temp135967 7959/tmp-246846292:org.apache.pig.builtin.BinStorage]
      2008-03-18 06:42:23,881 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]->[FILTER BY ( [PROJECT $0] eq ['a1'])]]
      2008-03-18 06:42:23,882 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
      2008-03-18 06:42:23,882 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
      2008-03-18 06:42:23,882 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
      2008-03-18 06:42:23,882 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output: /tmp/temp135967 7959/tmp1851797397:org.apache.pig.builtin.BinStorage
      2008-03-18 06:42:23,882 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
      2008-03-18 06:42:23,882 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
      2008-03-18 06:42:23,882 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism: -1
      2008-03-18 06:42:25,938 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - Pig progress = 0%
      2008-03-18 06:42:28,946 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - Pig progress = 50%
      2008-03-18 06:42:34,963 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - Pig progress = 100%
      (a1, 1, 5700)
      grunt> c = filter a by $0 eq 'b1';
      grunt> dump c;
      2008-03-18 06:42:59,884 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job — –
      2008-03-18 06:42:59,884 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input: [/tmp/temp135967 7959/tmp1851797397:org.apache.pig.builtin.BinStorage]
      2008-03-18 06:42:59,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]->[FILTER BY ( [PROJECT $0] eq ['b1'])]]
      2008-03-18 06:42:59,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
      2008-03-18 06:42:59,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
      2008-03-18 06:42:59,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
      2008-03-18 06:42:59,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output: /tmp/temp135967 7959/tmp-1157182212:org.apache.pig.builtin.BinStorage
      2008-03-18 06:42:59,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
      2008-03-18 06:42:59,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
      2008-03-18 06:42:59,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism: -1
      2008-03-18 06:43:01,964 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - Pig progress = 0%
      2008-03-18 06:43:04,974 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - Pig progress = 50%
      2008-03-18 06:43:06,980 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - Pig progress = 100%
      grunt>

      Meaning c is empty.

        Activity

        Hide
        Amir Youssefi added a comment -

        This may be a clue:

        Comparing Input and Output paths above we see that input of "c" calculation is using output of "b" even though c is dependent on a only.

        Show
        Amir Youssefi added a comment - This may be a clue: Comparing Input and Output paths above we see that input of "c" calculation is using output of "b" even though c is dependent on a only.
        Hide
        Pi Song added a comment -

        This one-line fix took me an hour to hunt!!! The problem that causes this is in MapReduce optimization where already-executed sub-plans are catched and reused as input of new plans.

        Unit test only does test the reported scenario. Other than that I couldn't find a non-intrusive way to do it.

        Show
        Pi Song added a comment - This one-line fix took me an hour to hunt!!! The problem that causes this is in MapReduce optimization where already-executed sub-plans are catched and reused as input of new plans. Unit test only does test the reported scenario. Other than that I couldn't find a non-intrusive way to do it.
        Hide
        Pi Song added a comment -

        Explain a bit more.

        This fix just changes a line in caching logic which seems to be obviously wrong.

        -            this.materializedResults.put(pom.sourceLogicalKey,
        +            this.materializedResults.put(plan.getRoot(),
                                                  new MapRedResult(pom.outputFileSpec,
                                                                    pom.reduceParallelism));
        

        Before we save logicalkey of the source data to the executed output.
        The right logic should be linking the output to the executed output.

        Show
        Pi Song added a comment - Explain a bit more. This fix just changes a line in caching logic which seems to be obviously wrong. - this.materializedResults.put(pom.sourceLogicalKey, + this.materializedResults.put(plan.getRoot(), new MapRedResult(pom.outputFileSpec, pom.reduceParallelism)); Before we save logicalkey of the source data to the executed output. The right logic should be linking the output to the executed output.
        Hide
        Pi Song added a comment -

        This is a trivial bug. Please review it.

        Show
        Pi Song added a comment - This is a trivial bug. Please review it.
        Hide
        Alan Gates added a comment -

        Fix checked in revision 646189. Thanks Pi.

        Show
        Alan Gates added a comment - Fix checked in revision 646189. Thanks Pi.

          People

          • Assignee:
            Pi Song
            Reporter:
            Amir Youssefi
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development