Issue Details (XML | Word | Printable)

Key: HADOOP-4012
Type: New Feature New Feature
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Abdul Qadeer
Reporter: Abdul Qadeer
Votes: 2
Watchers: 24
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Providing splitting support for bzip2 compressed files

Created: 23/Aug/08 03:39 AM   Updated: 08/Oct/09 04:04 AM
Return to search
Component/s: io
Affects Version/s: 0.21.0
Fix Version/s: 0.21.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works C4012-12.patch 2009-09-02 12:06 AM Chris Douglas 55 kB
Text File Licensed for inclusion in ASF works C4012-13.patch 2009-09-03 09:04 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works C4012-14.patch 2009-09-08 05:13 AM Chris Douglas 61 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version1.patch 2008-11-22 07:58 AM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version10.patch 2009-08-04 10:23 AM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version11.patch 2009-08-06 05:49 PM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version2.patch 2008-11-24 02:16 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version3.patch 2008-11-26 02:29 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version4.patch 2008-11-28 05:45 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version5.patch 2009-03-31 02:39 PM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version6.patch 2009-04-06 11:09 AM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version7.patch 2009-05-14 12:27 PM Abdul Qadeer 67 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version8.patch 2009-05-27 06:20 AM Abdul Qadeer 70 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version9.patch 2009-06-01 11:08 AM Abdul Qadeer 70 kB
Issue Links:
Blocker
 
Dependants
 
Reference
 
dependent
 

Hadoop Flags: Reviewed
Release Note: BZip2 files can now be split.
Resolution Date: 10/Sep/09 08:52 PM


 Description  « Hide
Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).

We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Repository Revision Date User Message
ASF #813581 Thu Sep 10 20:51:48 UTC 2009 cdouglas HADOOP-4012. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
Files Changed
MODIFY /hadoop/common/trunk/CHANGES.txt
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/bzip2/BZip2Constants.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/BlockDecompressorStream.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/DecompressorStream.java
ADD /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/SplittableCompressionCodec.java
MODIFY /hadoop/common/trunk/src/test/core/org/apache/hadoop/io/compress/TestCodec.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/CompressionInputStream.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/BZip2Codec.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/fs/FSInputChecker.java
ADD /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/SplitCompressionInputStream.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/bzip2/CBZip2InputStream.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/GzipCodec.java

Repository Revision Date User Message
ASF #813585 Thu Sep 10 20:58:34 UTC 2009 cdouglas HADOOP-4012. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
Files Changed
MODIFY /hadoop/mapreduce/trunk/lib/hadoop-core-0.21.0-dev.jar
MODIFY /hadoop/mapreduce/trunk/lib/hadoop-core-test-0.21.0-dev.jar

Repository Revision Date User Message
ASF #813587 Thu Sep 10 20:58:49 UTC 2009 cdouglas HADOOP-4012. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
Files Changed
MODIFY /hadoop/hdfs/trunk/lib/hadoop-core-test-0.21.0-dev.jar
MODIFY /hadoop/hdfs/trunk/lib/hadoop-core-0.21.0-dev.jar