Pig
  1. Pig
  2. PIG-3813

Rank column is assigned different uids everytime when schema is reset

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.12.0
    • Fix Version/s: 0.12.1, 0.13.0
    • Component/s: impl
    • Labels:
      None

      Description

      When the following script is run, pig goes into an infinite loop. This was reproduced on pig trunk as of March 12, 2014 on apache hadoop 1.2. test_data.txt has been attached.

      test.pig
      tWeek = LOAD '/tmp/test_data.txt' USING PigStorage ('|') AS (WEEK:int, DESCRIPTION:chararray, END_DATE:chararray, PERIOD:int);

      gTWeek = FOREACH tWeek GENERATE WEEK AS WEEK, PERIOD AS PERIOD;

      pWeek = FILTER gTWeek BY PERIOD == 201312;

      pWeekRanked = RANK pWeek BY WEEK ASC DENSE;

      gpWeekRanked = FOREACH pWeekRanked GENERATE $0;
      store gpWeekRanked into 'gpWeekRanked';
      describe gpWeekRanked;
      ---------------------------------------------------

      The res object of class Result, gets its value from leaf.getNextTuple()
      This gets an empty tuple
      ()
      with STATUS_OK.

      SO the while(true) condition never gets an End of Processing (EOP) and so does not exit.

      1. PIG-3813-2.patch
        2 kB
        Cheolsoo Park
      2. PIG-3813-1.patch
        1 kB
        Cheolsoo Park
      3. test_data.txt
        11 kB
        Suhas Satish

        Activity

        Hide
        Cheolsoo Park added a comment -

        Committed to 0.12 branch.

        Show
        Cheolsoo Park added a comment - Committed to 0.12 branch.
        Hide
        Suhas Satish added a comment -

        +1

        Show
        Suhas Satish added a comment - +1
        Hide
        Prashant Kommireddi added a comment -

        +1 to backport

        Show
        Prashant Kommireddi added a comment - +1 to backport
        Hide
        Rohini Palaniswamy added a comment -

        We should certainly backport this to 0.12.1 as it is a major bug for Rank operator.

        Show
        Rohini Palaniswamy added a comment - We should certainly backport this to 0.12.1 as it is a major bug for Rank operator.
        Hide
        Cheolsoo Park added a comment -

        I'd like to backport this into branch-0.12. Please let me know if you object, or I'll commit it in 0.12 tomorrow. Thanks!

        Show
        Cheolsoo Park added a comment - I'd like to backport this into branch-0.12. Please let me know if you object, or I'll commit it in 0.12 tomorrow. Thanks!
        Hide
        Suhas Satish added a comment -

        Thanks for the quick turn around Cheolsoo

        Show
        Suhas Satish added a comment - Thanks for the quick turn around Cheolsoo
        Hide
        Cheolsoo Park added a comment -

        Committed to trunk.
        Thank you Daniel for reviewing the patch!
        Thank you Suhas for reporting the issue!

        Show
        Cheolsoo Park added a comment - Committed to trunk. Thank you Daniel for reviewing the patch! Thank you Suhas for reporting the issue!
        Hide
        Daniel Dai added a comment -

        +1

        Show
        Daniel Dai added a comment - +1
        Hide
        Cheolsoo Park added a comment -

        Updating the title to describe the root cause.

        Show
        Cheolsoo Park added a comment - Updating the title to describe the root cause.
        Hide
        Cheolsoo Park added a comment -

        Here is the 2nd patch that keeps track of the uid of the rank column.

        Show
        Cheolsoo Park added a comment - Here is the 2nd patch that keeps track of the uid of the rank column.
        Hide
        Cheolsoo Park added a comment -

        Daniel Dai, thank you for the comment. Let me try your suggestion.

        Show
        Cheolsoo Park added a comment - Daniel Dai , thank you for the comment. Let me try your suggestion.
        Hide
        Daniel Dai added a comment -

        Seems LORank is generating the wrong uid. One thing missing in LORank is to keep generated uid across session, so a schema reset will not erase the old uid. We can refer to LOCogroup.groupKeyUidOnlySchema to fix LORank.

        Show
        Daniel Dai added a comment - Seems LORank is generating the wrong uid. One thing missing in LORank is to keep generated uid across session, so a schema reset will not erase the old uid. We can refer to LOCogroup.groupKeyUidOnlySchema to fix LORank.
        Hide
        Cheolsoo Park added a comment -

        Attaching a patch. Here is what I found-

        1. PushUpFilter moves up the filter by (pWeek) to above the foreach (gTWeek).
        2. Since pWeek is a direct predecessor of the rank (pWeekRanked), the PushUpFilter changes the output schema of pWeekRanked from rank_pWeek (uid:18) to rank_gTWeek (uid:24).
        3. After this, ColumnPruneVistor visits the 2nd foreach (gpWeekRanked). It checks whether the column $0 exists in the input schema of gpWeekRanked, which is equivalent to the output schema of pWeekRanked as follows-
          if (!inputUids.contains(project.getFieldSchema().uid))
              innerLoadsToRemove.add(innerLoad);
          

          The problem is that rank_pWeek (uid:18) is no longer in the input schema of gpWeekRanked, but rank_gTWeek (uid:24) is. So ColumnPruneVistor removes the column $0 from gpWeekRanked.

        4. Now gpWeekRanked generates empty tuples.

        The attached patch changes the condition in ColumnPruneVistor as follows-

        LogicalSchema.LogicalFieldSchema fs = project.findReferent().getSchema().getField(project.getColNum());
        if (!inputUids.contains(fs.uid)) {
            innerLoadsToRemove.add(innerLoad);
        }
        

        Basically, it uses the schema of the referent instead of that of the field itself in LOGenerate. This way, any changes in its predecessor are reflected correctly.

        I have never worked on this area of code. Please let me know if this fix is not proper.

        Show
        Cheolsoo Park added a comment - Attaching a patch. Here is what I found- PushUpFilter moves up the filter by (pWeek) to above the foreach (gTWeek). Since pWeek is a direct predecessor of the rank (pWeekRanked), the PushUpFilter changes the output schema of pWeekRanked from rank_pWeek (uid:18) to rank_gTWeek (uid:24). After this, ColumnPruneVistor visits the 2nd foreach (gpWeekRanked). It checks whether the column $0 exists in the input schema of gpWeekRanked, which is equivalent to the output schema of pWeekRanked as follows- if (!inputUids.contains(project.getFieldSchema().uid)) innerLoadsToRemove.add(innerLoad); The problem is that rank_pWeek (uid:18) is no longer in the input schema of gpWeekRanked, but rank_gTWeek (uid:24) is. So ColumnPruneVistor removes the column $0 from gpWeekRanked. Now gpWeekRanked generates empty tuples. The attached patch changes the condition in ColumnPruneVistor as follows- LogicalSchema.LogicalFieldSchema fs = project.findReferent().getSchema().getField(project.getColNum()); if (!inputUids.contains(fs.uid)) { innerLoadsToRemove.add(innerLoad); } Basically, it uses the schema of the referent instead of that of the field itself in LOGenerate. This way, any changes in its predecessor are reflected correctly. I have never worked on this area of code. Please let me know if this fix is not proper.
        Hide
        Cheolsoo Park added a comment -

        Suhas Satish helped me to reproduce the issue at the meet-up. Thanks!

        This seems like a column pruner bug. The query runs fine if I add SET pig.optimizer.rules.disabled ColumnMapKeyPrune.

        Assigning to me.

        Show
        Cheolsoo Park added a comment - Suhas Satish helped me to reproduce the issue at the meet-up. Thanks! This seems like a column pruner bug. The query runs fine if I add SET pig.optimizer.rules.disabled ColumnMapKeyPrune . Assigning to me.

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Suhas Satish
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development