Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4857

Last record is missing in STREAM operator

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • spark-branch
    • spark
    • None

    Description

      This bug is similar to PIG-4842.

      Scenario:

      cat input.txt
      1
      1
      2
      

      Pig script:

      REGISTER myudfs.jar;
      A = LOAD 'input.txt' USING myudfs.DummyCollectableLoader() AS (id); 
      B = GROUP A by $0 USING 'collected';    -- (1, {(1),(1)}), (2,{(2)})
      C = STREAM B THROUGH ` awk '{
           print $0;
      }'`;
      DUMP C;
      

      Expected Result:

      (1,{(1),(1)})
      (2,{(2)})
      

      Actual Result:

      (1,{(1),(1)})
      

      The last record is missing...

      Root Cause:
      When the flag endOfAllInput was set as true by the predecessor, the predecessor buffers the last record which is the input of Stream. Then POStream find endOfAllInput is true, in fact, the last input is not consumed yet.

      Attachments

        1. PIG-4857.patch
          1 kB
          Xianda Ke

        Issue Links

          Activity

            People

              kexianda Xianda Ke
              kexianda Xianda Ke
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: