Issue Details (XML | Word | Printable)

Key: HADOOP-4918
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Zheng Shao
Reporter: Zheng Shao
Votes: 0
Watchers: 6
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Fix bzip2 work with SequenceFile

Created: 19/Dec/08 12:43 AM   Updated: 24/Feb/09 11:37 PM
Return to search
Component/s: io
Affects Version/s: 0.19.0, 0.20.0, 0.21.0
Fix Version/s: 0.19.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-4918.3.0.19.patch 2009-01-15 06:07 AM Zheng Shao 15 kB
Text File Licensed for inclusion in ASF works HADOOP-4918.3.0.20.patch 2009-01-15 06:51 AM Zheng Shao 15 kB
Text File Licensed for inclusion in ASF works HADOOP-4918.3.patch 2009-01-14 11:47 PM Zheng Shao 15 kB
Issue Links:
Blocker
 
Reference
 

Hadoop Flags: Reviewed
Resolution Date: 16/Jan/09 07:56 PM


 Description  « Hide
Somehow bzip2 does not work with SequenceFile:
String codec = "org.apache.hadoop.io.compress.BZip2Codec";
    SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new Path(output), 
        reader.getKeyClass(), reader.getValueClass(), CompressionType.BLOCK, 
        (CompressionCodec)Class.forName(codec).newInstance());

The stack trace is here:

java.lang.UnsupportedOperationException
        at org.apache.hadoop.io.compress.BZip2Codec.getCompressorType(BZip2Codec.java:80)
        at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:98)
        at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:914)
        at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.<init>(SequenceFile.java:1198)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:401)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:329)
        at org.apache.hadoop.mapred.TestSequenceFileBZip.main(TestSequenceFileBZip.java:43)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Zheng Shao added a comment - 19/Dec/08 12:46 AM
A java class that reads a SequenceFile and write it out with specified codec and block size.

This is used to test the time and space efficiency of bzip2 with SequenceFile.


Zheng Shao added a comment - 20/Dec/08 02:19 AM
It seems the reason is that BZip2Codec did not split the Compressor and Stream.

One simple approach to make it work is to have a dummy Compressor and Decompressor class.
So the following 6 functions can be implemented using the dummy classes:

createOutputStream(OutputStream, Compressor)
getCompressorType()
createCompressor()
createInputStream(InputStream, Decompressor)
getDecompressorType()
createDecompressor()

by using

createOutputStream(OutputStream)
createInputStream(InputStream)

Zheng Shao added a comment - 30/Dec/08 02:09 AM
I saw this piece of code in TestCodec.java.

Unfortunately SequenceFileWriter.BlockCompressWriter is not calling close() on the deflateOut for each block. As a result, the codec is not working.

//Necessary to close the stream for BZip2 Codec to write its final output.  Flush is not enough.
    deflateOut.close();

We will probably need to modify BZip2 Codec to make this work.


Zheng Shao added a comment - 13/Jan/09 11:04 AM
Modified the BZip2Codec implementation a bit, and changed the way hadoop interact with the BZip2 implementation to make sure SequenceFile works BZip2Codec.

Zheng Shao added a comment - 14/Jan/09 08:28 AM
Added a test case.

Zheng Shao added a comment - 14/Jan/09 11:47 PM
Added some new files (forgot to svn add last time)

Zheng Shao added a comment - 14/Jan/09 11:48 PM
[exec]
[exec] BUILD SUCCESSFUL
[exec] Total time: 9 minutes 8 seconds
[exec] Starting with /home/zshao/tmp/trunkFindbugsWarnings.xml
[exec] Merging /home/zshao/tmp/patchFindbugsWarnings.xml
[exec]
[exec]
[exec] ======================================================================
[exec] ======================================================================
[exec] Running Eclipse classpath verification.
[exec] ======================================================================
[exec] ======================================================================
[exec]
[exec]
[exec]
[exec]
[exec]
[exec]
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec]
[exec]
[exec]
[exec] ======================================================================
[exec] ======================================================================
[exec] Finished build.
[exec] ======================================================================
[exec] ======================================================================
[exec]
[exec]

BUILD SUCCESSFUL
Total time: 30 minutes 55 seconds


Zheng Shao added a comment - 15/Jan/09 06:07 AM
Patch for branch 0.19.

Abdul Qadeer added a comment - 15/Jan/09 06:18 AM
Zheng,

I went through all of your BZip2 related changes to my previous BZip2Codec code and they are looking fine to me.
so a +1 as far as BZip2 related stuff is concerned.

You just uploaded a new patch for 0.19 branch. Why we need different patches for different branches?


dhruba borthakur added a comment - 15/Jan/09 06:43 AM
Are you guys saying that this patch needs to go into 0.19 as well? But this is a new feature, isn't it?

Zheng Shao added a comment - 15/Jan/09 06:51 AM
Patch for 0.20.

Zheng Shao added a comment - 15/Jan/09 06:52 AM
Log for 0.19 test:

[exec] BUILD SUCCESSFUL
[exec] Total time: 3 minutes 19 seconds
[exec] Starting with /home/zshao/tmp/trunkFindbugsWarnings.xml
[exec] Merging /home/zshao/tmp/patchFindbugsWarnings.xml
[exec]
[exec]
[exec]
[exec]
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec]
[exec]
[exec]
[exec] ======================================================================
[exec] ======================================================================
[exec] Finished build.
[exec] ======================================================================
[exec] ======================================================================
[exec]
[exec]


Zheng Shao added a comment - 15/Jan/09 06:54 AM
@Abdul, the only difference between 0.19 patch and 0.20 patch (trunk patchis the same as 0.20 patch) is a line offset difference. Basically it's essentially the same patch.

@Drhuba, the BZip2Codec does not work with SequenceFile right now in 0.19. I consider that to be a bug. What do you think?


Zheng Shao added a comment - 15/Jan/09 08:10 AM
Log for 0.20 test:

[exec]
[exec] BUILD SUCCESSFUL
[exec] Total time: 3 minutes 56 seconds
[exec] Starting with /home/zshao/tmp/trunkFindbugsWarnings.xml
[exec] Merging /home/zshao/tmp/patchFindbugsWarnings.xml
[exec]
[exec]
[exec] ======================================================================
[exec] ======================================================================
[exec] Running Eclipse classpath verification.
[exec] ======================================================================
[exec] ======================================================================
[exec]
[exec]
[exec]
[exec]
[exec]
[exec]
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec]
[exec]
[exec]
[exec] ======================================================================
[exec] ======================================================================
[exec] Finished build.
[exec] ======================================================================
[exec] ======================================================================
[exec]
[exec]

BUILD SUCCESSFUL
Total time: 14 minutes 57 seconds


Doug Cutting added a comment - 15/Jan/09 05:30 PM
> BZip2Codec does not work with SequenceFile right now in 0.19. I consider that to be a bug.

The standard criteria is whether it is a regression. Did it work in a release prior to 0.19? If so, then it's a regression and should be fixed in 0.19. If not then it's a new feature, and should be added in 0.20.

However sometimes, as an exception, we permit fixes to all-new code, e.g., a new contrib module, that are not regressions, if they have zero chance of causing a regression anywhere else. This patch touches only files that were added in 0.19, and those files were themselves an independent addition (http://svn.apache.org/viewvc?view=rev&revision=680802), so I see no possibility for this creating any regressions in 0.19 and would not oppose treating it as such an exception.


dhruba borthakur added a comment - 15/Jan/09 07:27 PM
Ok, thanks. I will commit this to 0.19, 0.20 and trunk.

dhruba borthakur added a comment - 16/Jan/09 07:56 PM
I just committed this. Thanks Zheng!