Pig
  1. Pig
  2. PIG-3352

Bzip2TextInputFormat can duplicate records across splits

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.10.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      If a bz2 block boundary occurs in the middle of a record that is terminated by a carriage-return then the next record will be duplicated. The compressed stream position is updated at the same time a carriage-return character is seen without a subsequent line-feed character. Based on the method of reporting position within the compression stream, it incorrectly believes it has read only the carriage-return character into the next compression block and ends up processing the next record which will also be processed by the consumer of the next split.

        Issue Links

          Activity

          Hide
          Jason Lowe added a comment -

          Discovered this while investigating HADOOP-9622. Attaching a modified version of blockEndingWithCR.txt.bz2 used in TestBZip.testBlockHeaderEndingWithCR where I simply moved the carriage-return character one word over. Therefore the number of records is the same, but when it is processed with splits a record is duplicated.

          This can also be seen with a simple pig script that loads and dumps the file, where the script is run with and without mapred.max.split.size=136498 and comparing the two outputs.

          I would expect a bzip2-compressed file full of carriage-return-delimited records to exhibit this as well, since it's likely a block boundary straddles a record ending with just a carriage-return character.

          Show
          Jason Lowe added a comment - Discovered this while investigating HADOOP-9622 . Attaching a modified version of blockEndingWithCR.txt.bz2 used in TestBZip.testBlockHeaderEndingWithCR where I simply moved the carriage-return character one word over. Therefore the number of records is the same, but when it is processed with splits a record is duplicated. This can also be seen with a simple pig script that loads and dumps the file, where the script is run with and without mapred.max.split.size=136498 and comparing the two outputs. I would expect a bzip2-compressed file full of carriage-return-delimited records to exhibit this as well, since it's likely a block boundary straddles a record ending with just a carriage-return character.

            People

            • Assignee:
              Unassigned
              Reporter:
              Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development