Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4857

Last record is missing in STREAM operator

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: spark-branch
    • Component/s: spark
    • Labels:
      None

      Description

      This bug is similar to PIG-4842.

      Scenario:

      cat input.txt
      1
      1
      2
      

      Pig script:

      REGISTER myudfs.jar;
      A = LOAD 'input.txt' USING myudfs.DummyCollectableLoader() AS (id); 
      B = GROUP A by $0 USING 'collected';    -- (1, {(1),(1)}), (2,{(2)})
      C = STREAM B THROUGH ` awk '{
           print $0;
      }'`;
      DUMP C;
      

      Expected Result:

      (1,{(1),(1)})
      (2,{(2)})
      

      Actual Result:

      (1,{(1),(1)})
      

      The last record is missing...

      Root Cause:
      When the flag endOfAllInput was set as true by the predecessor, the predecessor buffers the last record which is the input of Stream. Then POStream find endOfAllInput is true, in fact, the last input is not consumed yet.

        Attachments

        1. PIG-4857.patch
          1 kB
          Xianda Ke

          Issue Links

            Activity

              People

              • Assignee:
                kexianda Xianda Ke
                Reporter:
                kexianda Xianda Ke
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: