Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0, 0.10.0
    • Fix Version/s: 0.9.0, 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      A new annotation, @Nondeterministic, is introduced to allow UDF authors to mark their UDFs as such.

      A non-deterministic UDF is one that can produce different results when invoked on the same input. Examples of non-deterministic behavior might be, for example, getCurrentTime() or RANDOM.

      Certain Pig optimizations depend on UDFs being deterministic. It is therefore very important for correctness that non-deterministic UDFs be annotated as such.
      Show
      A new annotation, @Nondeterministic, is introduced to allow UDF authors to mark their UDFs as such. A non-deterministic UDF is one that can produce different results when invoked on the same input. Examples of non-deterministic behavior might be, for example, getCurrentTime() or RANDOM. Certain Pig optimizations depend on UDFs being deterministic. It is therefore very important for correctness that non-deterministic UDFs be annotated as such.

      Description

      Consider the following code:

      tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, weight:double);
      grouped   = GROUP tfidf_all BY doc_id;
      vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) AS vector;
      DUMP vectors;
      

      This, of course, runs just fine. In a real example, tfidf_all contains 1,428,280 records. The reduce output records should be exactly the number of documents, which turn out to be 18,863 in this case. All well and good.

      The strangeness comes when you add a SAMPLE command:

      sampled = SAMPLE vectors 0.0012;
      DUMP sampled;
      

      Running this results in 1,513 reduce output records. The reduce output records be much much closer to 22 or 23 records (eg. 0.0012*18863).

      Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in front of the group. It shouldn't push that filter
      since the UDF is non-deterministic.

      Quick fix: If you add "-t PushUpFilter" to your command line when invoking pig this won't happen.

      1. PIG-2014.5.patch
        32 kB
        Dmitriy V. Ryaboy
      2. PIG-2014.4.patch
        38 kB
        Dmitriy V. Ryaboy
      3. PIG-2014.3.patch
        2 kB
        Daniel Dai
      4. PIG-2014.2.patch
        11 kB
        Dmitriy V. Ryaboy
      5. PIG-2014.patch
        7 kB
        Dmitriy V. Ryaboy

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Dmitriy V. Ryaboy
              Reporter:
              Jacob Perkins
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development