Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3352

Bzip2TextInputFormat can duplicate records across splits

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.10.1
    • None
    • None
    • None

    Description

      If a bz2 block boundary occurs in the middle of a record that is terminated by a carriage-return then the next record will be duplicated. The compressed stream position is updated at the same time a carriage-return character is seen without a subsequent line-feed character. Based on the method of reporting position within the compression stream, it incorrectly believes it has read only the carriage-return character into the next compression block and ends up processing the next record which will also be processed by the consumer of the next split.

      Attachments

        1. blockEndingInRecordWithCR.txt.bz2
          267 kB
          Jason Darrell Lowe

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jlowe Jason Darrell Lowe
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: