Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0, 0.10.0
    • Fix Version/s: 0.9.0, 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      A new annotation, @Nondeterministic, is introduced to allow UDF authors to mark their UDFs as such.

      A non-deterministic UDF is one that can produce different results when invoked on the same input. Examples of non-deterministic behavior might be, for example, getCurrentTime() or RANDOM.

      Certain Pig optimizations depend on UDFs being deterministic. It is therefore very important for correctness that non-deterministic UDFs be annotated as such.
      Show
      A new annotation, @Nondeterministic, is introduced to allow UDF authors to mark their UDFs as such. A non-deterministic UDF is one that can produce different results when invoked on the same input. Examples of non-deterministic behavior might be, for example, getCurrentTime() or RANDOM. Certain Pig optimizations depend on UDFs being deterministic. It is therefore very important for correctness that non-deterministic UDFs be annotated as such.

      Description

      Consider the following code:

      tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, weight:double);
      grouped   = GROUP tfidf_all BY doc_id;
      vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) AS vector;
      DUMP vectors;
      

      This, of course, runs just fine. In a real example, tfidf_all contains 1,428,280 records. The reduce output records should be exactly the number of documents, which turn out to be 18,863 in this case. All well and good.

      The strangeness comes when you add a SAMPLE command:

      sampled = SAMPLE vectors 0.0012;
      DUMP sampled;
      

      Running this results in 1,513 reduce output records. The reduce output records be much much closer to 22 or 23 records (eg. 0.0012*18863).

      Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in front of the group. It shouldn't push that filter
      since the UDF is non-deterministic.

      Quick fix: If you add "-t PushUpFilter" to your command line when invoking pig this won't happen.

      1. PIG-2014.patch
        7 kB
        Dmitriy V. Ryaboy
      2. PIG-2014.2.patch
        11 kB
        Dmitriy V. Ryaboy
      3. PIG-2014.3.patch
        2 kB
        Daniel Dai
      4. PIG-2014.4.patch
        38 kB
        Dmitriy V. Ryaboy
      5. PIG-2014.5.patch
        32 kB
        Dmitriy V. Ryaboy

        Issue Links

          Activity

          Hide
          Dmitriy V. Ryaboy added a comment -

          Proposal for a general-case fix:

          add a @Nondeterministic annotation for UDFs, have the PushUpFilter check whether the udf is deterministic or not when considering whether pushing up is ok. Annotate the filter UDF that sample is rewritten to, accordingly.

          Show
          Dmitriy V. Ryaboy added a comment - Proposal for a general-case fix: add a @Nondeterministic annotation for UDFs, have the PushUpFilter check whether the udf is deterministic or not when considering whether pushing up is ok. Annotate the filter UDF that sample is rewritten to, accordingly.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Implemented suggested approach to fixing this. Please review.

          Show
          Dmitriy V. Ryaboy added a comment - Implemented suggested approach to fixing this. Please review.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Since this is a bug it should go into 0.9.

          Then again since it's sort of a new feature maybe not and we should just check for RANDOM specifically in 0.9?

          For my part I don't see the harm in the annotation and would prefer that it goes into 0.9.

          Show
          Dmitriy V. Ryaboy added a comment - Since this is a bug it should go into 0.9. Then again since it's sort of a new feature maybe not and we should just check for RANDOM specifically in 0.9? For my part I don't see the harm in the annotation and would prefer that it goes into 0.9.
          Hide
          Daniel Dai added a comment -

          I think this is more a bug fix, should go into 0.9.

          However, the script will trigger rule FilterAboveForeach not PushUpFilter. So the patch does not fix the problem. The fix should go to all these rules: PushUpFilter, PushDownForEachFlatten, FilterAboveForeach.

          Show
          Daniel Dai added a comment - I think this is more a bug fix, should go into 0.9. However, the script will trigger rule FilterAboveForeach not PushUpFilter. So the patch does not fix the problem. The fix should go to all these rules: PushUpFilter, PushDownForEachFlatten, FilterAboveForeach.
          Hide
          Dmitriy V. Ryaboy added a comment -

          I'll add to the other rules – but for the record, I looked at the plan and saw the sample not being pushed up after my patch .

          Show
          Dmitriy V. Ryaboy added a comment - I'll add to the other rules – but for the record, I looked at the plan and saw the sample not being pushed up after my patch .
          Hide
          Daniel Dai added a comment -

          I think it is because in TestNewPlanFilterAboveForeach, we only invoke some of the rules. If you do an explain, you will see filter still pushed up.

          Show
          Daniel Dai added a comment - I think it is because in TestNewPlanFilterAboveForeach, we only invoke some of the rules. If you do an explain, you will see filter still pushed up.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Daniel,
          So this is interesting. I took my fix out, left the test in, and the test still passed – because, as you correctly pointed out, TestNewPlanFilterAboveForeach only invokes a few of the rules. If I add PushUpFilter to MyPlanOptimizer within that test, my new test starts failing if the fix is not present, and passes if the fix is present. So the PushUpFilter is definitely at least part of what's causing the movement of Filter in this case.

          So I need to fix up PushDownForEachFlatten and FilterAboveForeach, and I need to fix my test .

          Show
          Dmitriy V. Ryaboy added a comment - Daniel, So this is interesting. I took my fix out, left the test in, and the test still passed – because, as you correctly pointed out, TestNewPlanFilterAboveForeach only invokes a few of the rules. If I add PushUpFilter to MyPlanOptimizer within that test, my new test starts failing if the fix is not present, and passes if the fix is present. So the PushUpFilter is definitely at least part of what's causing the movement of Filter in this case. So I need to fix up PushDownForEachFlatten and FilterAboveForeach, and I need to fix my test .
          Hide
          Dmitriy V. Ryaboy added a comment -

          This addresses PushUpFilter and FilterAboveForeach, and fixes the SAMPLE issue.

          I didn't tackle PushDownForeachFlatten – there's a lot going on there and I'm not sure I understand it all. We should open a separate ticket for making sure that optimization does not break on nondeterministic operations.

          Show
          Dmitriy V. Ryaboy added a comment - This addresses PushUpFilter and FilterAboveForeach, and fixes the SAMPLE issue. I didn't tackle PushDownForeachFlatten – there's a lot going on there and I'm not sure I understand it all. We should open a separate ticket for making sure that optimization does not break on nondeterministic operations.
          Hide
          Daniel Dai added a comment -

          +1 for PIG-2014.2.patch.

          I will submit another patch for PushDownForeachFlatten.

          Show
          Daniel Dai added a comment - +1 for PIG-2014 .2.patch. I will submit another patch for PushDownForeachFlatten.
          Hide
          Daniel Dai added a comment -

          Also change filterHasNonDeterministicUdf to planHasNonDeterministicUdf. I can reuse it in PushDownForeachFlatten.

          Show
          Daniel Dai added a comment - Also change filterHasNonDeterministicUdf to planHasNonDeterministicUdf. I can reuse it in PushDownForeachFlatten.
          Hide
          Daniel Dai added a comment -

          PIG-2014.3.patch address PushDownFlattenForEach

          Show
          Daniel Dai added a comment - PIG-2014 .3.patch address PushDownFlattenForEach
          Hide
          Dmitriy V. Ryaboy added a comment -

          Attaching combined final version, complete with the renamed method.

          Show
          Dmitriy V. Ryaboy added a comment - Attaching combined final version, complete with the renamed method.
          Hide
          Dmitriy V. Ryaboy added a comment -

          An unrelated change snuck into the previously attached file; this is what I am actually committing.

          Show
          Dmitriy V. Ryaboy added a comment - An unrelated change snuck into the previously attached file; this is what I am actually committing.
          Hide
          Dmitriy V. Ryaboy added a comment -

          committed to 0.9 and trunk.

          Show
          Dmitriy V. Ryaboy added a comment - committed to 0.9 and trunk.

            People

            • Assignee:
              Dmitriy V. Ryaboy
              Reporter:
              Jacob Perkins
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development