Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-15206

BZip2 drops and duplicates records when input split size is small

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.8.3, 3.0.0
    • 3.1.0, 2.10.0, 2.9.1, 2.8.4, 2.7.6, 3.0.3
    • None
    • None
    • Reviewed

    Description

      BZip2 can drop and duplicate record when input split file is small. I confirmed that this issue happens when the input split size is between 1byte and 4bytes.

      I am seeing the following 2 problem behaviors.

       

      1. Drop record:

      BZip2 skips the first record in the input file when the input split size is small

       

      Set the split size to 3 and tested to load 100 records (0, 1, 2..99)

      2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(317)) - splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3 count=99

      > The input format read only 99 records but not 100 records

       

      2. Duplicate Record:

      2 input splits has same BZip2 records when the input split size is small

       

      Set the split size to 1 and tested to load 100 records (0, 1, 2..99)

       

      2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1 count=99
      2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 at position 8
      

       

      I experienced this error when I execute Spark (SparkSQL) job under the following conditions:

      • The file size of the input files are small (around 1KB)
      • Hadoop cluster has many slave nodes (able to launch many executor tasks)

       

      Attachments

        1. HADOOP-15206.001.patch
          2 kB
          Aki Tanaka
        2. HADOOP-15206.002.patch
          3 kB
          Aki Tanaka
        3. HADOOP-15206.003.patch
          3 kB
          Aki Tanaka
        4. HADOOP-15206.004.patch
          6 kB
          Aki Tanaka
        5. HADOOP-15206.005.patch
          6 kB
          Aki Tanaka
        6. HADOOP-15206.006.patch
          4 kB
          Aki Tanaka
        7. HADOOP-15206.007.patch
          3 kB
          Aki Tanaka
        8. HADOOP-15206.008.patch
          4 kB
          Aki Tanaka
        9. HADOOP-15206-test.patch
          1 kB
          Aki Tanaka

        Activity

          People

            tanakahda Aki Tanaka
            tanakahda Aki Tanaka
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: