Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3352

Bzip2TextInputFormat can duplicate records across splits

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.10.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      If a bz2 block boundary occurs in the middle of a record that is terminated by a carriage-return then the next record will be duplicated. The compressed stream position is updated at the same time a carriage-return character is seen without a subsequent line-feed character. Based on the method of reporting position within the compression stream, it incorrectly believes it has read only the carriage-return character into the next compression block and ends up processing the next record which will also be processed by the consumer of the next split.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jlowe Jason Lowe
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: