Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2014

SAMPLE shouldn't be pushed up

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9.0, 0.10.0
    • 0.9.0, 0.10.0
    • None
    • None
    • Hide
      A new annotation, @Nondeterministic, is introduced to allow UDF authors to mark their UDFs as such.

      A non-deterministic UDF is one that can produce different results when invoked on the same input. Examples of non-deterministic behavior might be, for example, getCurrentTime() or RANDOM.

      Certain Pig optimizations depend on UDFs being deterministic. It is therefore very important for correctness that non-deterministic UDFs be annotated as such.
      Show
      A new annotation, @Nondeterministic, is introduced to allow UDF authors to mark their UDFs as such. A non-deterministic UDF is one that can produce different results when invoked on the same input. Examples of non-deterministic behavior might be, for example, getCurrentTime() or RANDOM. Certain Pig optimizations depend on UDFs being deterministic. It is therefore very important for correctness that non-deterministic UDFs be annotated as such.

    Description

      Consider the following code:

      tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, weight:double);
      grouped   = GROUP tfidf_all BY doc_id;
      vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) AS vector;
      DUMP vectors;
      

      This, of course, runs just fine. In a real example, tfidf_all contains 1,428,280 records. The reduce output records should be exactly the number of documents, which turn out to be 18,863 in this case. All well and good.

      The strangeness comes when you add a SAMPLE command:

      sampled = SAMPLE vectors 0.0012;
      DUMP sampled;
      

      Running this results in 1,513 reduce output records. The reduce output records be much much closer to 22 or 23 records (eg. 0.0012*18863).

      Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in front of the group. It shouldn't push that filter
      since the UDF is non-deterministic.

      Quick fix: If you add "-t PushUpFilter" to your command line when invoking pig this won't happen.

      Attachments

        1. PIG-2014.5.patch
          32 kB
          Dmitriy V. Ryaboy
        2. PIG-2014.4.patch
          38 kB
          Dmitriy V. Ryaboy
        3. PIG-2014.3.patch
          2 kB
          Daniel Dai
        4. PIG-2014.2.patch
          11 kB
          Dmitriy V. Ryaboy
        5. PIG-2014.patch
          7 kB
          Dmitriy V. Ryaboy

        Issue Links

          Activity

            People

              dvryaboy Dmitriy V. Ryaboy
              thedatachef Jacob Perkins
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: