Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.1, 2.2.0
    • Fix Version/s: 1.3.0, 2.5.0
    • Component/s: None
    • Labels:
      None

      Description

      Hadoop uses CBZip2InputStream to decode bzip2 files. However, the implementation is not threadsafe. This is not a really problem for Hadoop MapReduce because Hadoop runs each task in a separate JVM. But for other libraries that utilize multithreading and use Hadoop's InputFormat, e.g., Spark, it will cause exceptions like the following:

      java.lang.ArrayIndexOutOfBoundsException: 6 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:729) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:795) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:428) java.io.InputStream.read(InputStream.java:101) org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:35) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1000) org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724)
      
      1. bzip2-2.diff
        2 kB
        Xiangrui Meng
      2. bzip2.diff
        2 kB
        Xiangrui Meng

        Issue Links

          Activity

          Hide
          Xiangrui Meng added a comment -

          Checked the code in the trunk. This class has a static boolean member `skipDecompression`, which indicates whether it is decompressing or checking the next marker.

          Show
          Xiangrui Meng added a comment - Checked the code in the trunk. This class has a static boolean member `skipDecompression`, which indicates whether it is decompressing or checking the next marker.
          Hide
          Xiangrui Meng added a comment -

          Changed `skipCompression` to `private` from `private static`.
          Added a private constructor to set `skipCompression`.

          Show
          Xiangrui Meng added a comment - Changed `skipCompression` to `private` from `private static`. Added a private constructor to set `skipCompression`.
          Hide
          Xiangrui Meng added a comment -

          Tested on Spark master and Hadoop 1.2.1.

          Show
          Xiangrui Meng added a comment - Tested on Spark master and Hadoop 1.2.1.
          Hide
          Sandy Ryza added a comment -

          It seems like there's no good reason for that variable to be static. +1.

          Minor nit:

          +  private CBZip2InputStream(final InputStream in, READ_MODE readMode, boolean skipDecompression)
          +      throws IOException {
          +
               super();
          

          No need for an empty line after the method declaration. I can fix this on commit.

          Show
          Sandy Ryza added a comment - It seems like there's no good reason for that variable to be static. +1. Minor nit: + private CBZip2InputStream( final InputStream in, READ_MODE readMode, boolean skipDecompression) + throws IOException { + super (); No need for an empty line after the method declaration. I can fix this on commit.
          Hide
          Xiangrui Meng added a comment -

          Remove line "Instances of this class are not threadsafe." from JavaDoc.

          Show
          Xiangrui Meng added a comment - Remove line "Instances of this class are not threadsafe." from JavaDoc.
          Hide
          Xiangrui Meng added a comment -

          Thanks for reviewing! I removed that empty line in `bzip2-2.diff`.

          Show
          Xiangrui Meng added a comment - Thanks for reviewing! I removed that empty line in `bzip2-2.diff`.
          Hide
          Andrew Ash added a comment -

          Thanks Sandy and Xiangrui for fixing this issue!

          Show
          Andrew Ash added a comment - Thanks Sandy and Xiangrui for fixing this issue!
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #5606 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5606/)
          HADOOP-10614. CBZip2InputStream is not threadsafe (Xiangrui Meng via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1595521)

          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/bzip2/CBZip2InputStream.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #5606 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5606/ ) HADOOP-10614 . CBZip2InputStream is not threadsafe (Xiangrui Meng via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1595521 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/bzip2/CBZip2InputStream.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #562 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/562/)
          HADOOP-10614. CBZip2InputStream is not threadsafe (Xiangrui Meng via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1595521)

          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/bzip2/CBZip2InputStream.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #562 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/562/ ) HADOOP-10614 . CBZip2InputStream is not threadsafe (Xiangrui Meng via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1595521 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/bzip2/CBZip2InputStream.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1754 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1754/)
          HADOOP-10614. CBZip2InputStream is not threadsafe (Xiangrui Meng via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1595521)

          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/bzip2/CBZip2InputStream.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1754 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1754/ ) HADOOP-10614 . CBZip2InputStream is not threadsafe (Xiangrui Meng via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1595521 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/bzip2/CBZip2InputStream.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1780 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1780/)
          HADOOP-10614. CBZip2InputStream is not threadsafe (Xiangrui Meng via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1595521)

          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/bzip2/CBZip2InputStream.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1780 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1780/ ) HADOOP-10614 . CBZip2InputStream is not threadsafe (Xiangrui Meng via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1595521 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/bzip2/CBZip2InputStream.java
          Hide
          Andrew Ash added a comment -

          I can't quite tell – are these Hudson failures a problem?

          Show
          Andrew Ash added a comment - I can't quite tell – are these Hudson failures a problem?
          Hide
          Sandy Ryza added a comment -

          Naw these are typical

          Show
          Sandy Ryza added a comment - Naw these are typical
          Hide
          Andrew Ash added a comment -

          Cool thanks

          Show
          Andrew Ash added a comment - Cool thanks
          Hide
          Xiangrui Meng added a comment -

          Sandy Ryza Is it going to be included in the next bug fix releases?

          Show
          Xiangrui Meng added a comment - Sandy Ryza Is it going to be included in the next bug fix releases?

            People

            • Assignee:
              Xiangrui Meng
              Reporter:
              Xiangrui Meng
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development