Pig
  1. Pig
  2. PIG-272

Failure running complex script with streaming

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None

      Description

      The following script fails (stack is further down):

      define CMD `perl identity.pl`;
      define CMD1 `perl identity.pl`;
      A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
      B = stream A through CMD;
      store B into 'B1';
      C = stream B through CMD1;
      D = JOIN B by name, C by name;
      store D into 'D1';

      If I remove the intermediate store, the script works fine. Also if I replace streaming commands with other operators such as filter and foreach, it works even with the intermediate store.

      1. PIG-272_0_20080621.patch
        11 kB
        Arun C Murthy
      2. PIG-272_test.pig
        0.9 kB
        Arun C Murthy
      3. split.pl
        0.4 kB
        Arun C Murthy

        Activity

        Olga Natkovich created issue -
        Olga Natkovich made changes -
        Field Original Value New Value
        Assignee Arun C Murthy [ acmurthy ]
        Hide
        Olga Natkovich added a comment -

        Arun, helped to diagnose the problem. The issue is that the following sequence

        B = stream A through CMD;
        store B into 'B1';

        kicks in the optimization and as the result store users BinaryStorage to write the results of the first job.

        When the second job starts to run, it realizes that it can reuse the results and tries to load them also using BinaryStorage which is wrong and causes exceptions since the tuples don't have structure expected by the second script.

        The solution is to attach the original store function to the materialized results; however, the code changes for it are quite ugly.

        Show
        Olga Natkovich added a comment - Arun, helped to diagnose the problem. The issue is that the following sequence B = stream A through CMD; store B into 'B1'; kicks in the optimization and as the result store users BinaryStorage to write the results of the first job. When the second job starts to run, it realizes that it can reuse the results and tries to load them also using BinaryStorage which is wrong and causes exceptions since the tuples don't have structure expected by the second script. The solution is to attach the original store function to the materialized results; however, the code changes for it are quite ugly.
        Hide
        Arun C Murthy added a comment - - edited

        Sigh, attaching the original store function isn't enough.

        The problem is that currently Pig re-executes the entire pipeline and doesn't use the existing results on HDFS for the JOIN in the above example. When that happens the StreamingCommand's output-spec is still setup as 'BinaryStorage' and results in this error.

        Show
        Arun C Murthy added a comment - - edited Sigh, attaching the original store function isn't enough. The problem is that currently Pig re-executes the entire pipeline and doesn't use the existing results on HDFS for the JOIN in the above example. When that happens the StreamingCommand's output-spec is still setup as 'BinaryStorage' and results in this error.
        Hide
        Arun C Murthy added a comment -

        To clarify the above comment: it seems like the 'materialized results' of the first of the two resulting Map-Reduce jobs isn't being used by the second. Rather, it goes ahead and re-executes the entire pipeline. Clearly, it is rather inefficient. Thus, it looks like the existing code for tracking/using previous job's results has a bug.

        Show
        Arun C Murthy added a comment - To clarify the above comment: it seems like the 'materialized results' of the first of the two resulting Map-Reduce jobs isn't being used by the second. Rather, it goes ahead and re-executes the entire pipeline. Clearly, it is rather inefficient. Thus, it looks like the existing code for tracking/using previous job's results has a bug.
        Hide
        Arun C Murthy added a comment -

        Attached fix. The patch ensures we deep-copy the StreamingCommand before optimizing it and reverts the optimization piece-meal (i.e for input and output separately).

        The test cases are quite complex/convoluted and are pretty hard to convert to unit-tests, which I why I've attached them here and propose we integrate them into our end-to-end tests...

        Show
        Arun C Murthy added a comment - Attached fix. The patch ensures we deep-copy the StreamingCommand before optimizing it and reverts the optimization piece-meal (i.e for input and output separately). The test cases are quite complex/convoluted and are pretty hard to convert to unit-tests, which I why I've attached them here and propose we integrate them into our end-to-end tests...
        Arun C Murthy made changes -
        Attachment split.pl [ 12384473 ]
        Attachment PIG-272_0_20080621.patch [ 12384471 ]
        Attachment PIG-272_test.pig [ 12384472 ]
        Arun C Murthy made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Olga Natkovich added a comment -

        Thanks, Arun. I will be testing your patch today.

        Show
        Olga Natkovich added a comment - Thanks, Arun. I will be testing your patch today.
        Hide
        Olga Natkovich added a comment -

        I committed the changes. I ran all existing unit and end-to-end tests as well as the end-to-end tests provided by Arun. They all passed.

        Thanks, Arun for fixing this issue.

        Show
        Olga Natkovich added a comment - I committed the changes. I ran all existing unit and end-to-end tests as well as the end-to-end tests provided by Arun. They all passed. Thanks, Arun for fixing this issue.
        Olga Natkovich made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Olga Natkovich made changes -
        Fix Version/s 0.1.0 [ 12312848 ]
        Alan Gates made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Arun C Murthy
            Reporter:
            Olga Natkovich
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development