Issue Details (XML | Word | Printable)

Key: HADOOP-4012
Type: New Feature New Feature
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Abdul Qadeer
Reporter: Abdul Qadeer
Votes: 2
Watchers: 24
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Providing splitting support for bzip2 compressed files

Created: 23/Aug/08 03:39 AM   Updated: 08/Oct/09 04:04 AM
Return to search
Component/s: io
Affects Version/s: 0.21.0
Fix Version/s: 0.21.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works C4012-12.patch 2009-09-02 12:06 AM Chris Douglas 55 kB
Text File Licensed for inclusion in ASF works C4012-13.patch 2009-09-03 09:04 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works C4012-14.patch 2009-09-08 05:13 AM Chris Douglas 61 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version1.patch 2008-11-22 07:58 AM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version10.patch 2009-08-04 10:23 AM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version11.patch 2009-08-06 05:49 PM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version2.patch 2008-11-24 02:16 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version3.patch 2008-11-26 02:29 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version4.patch 2008-11-28 05:45 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version5.patch 2009-03-31 02:39 PM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version6.patch 2009-04-06 11:09 AM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version7.patch 2009-05-14 12:27 PM Abdul Qadeer 67 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version8.patch 2009-05-27 06:20 AM Abdul Qadeer 70 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version9.patch 2009-06-01 11:08 AM Abdul Qadeer 70 kB
Issue Links:
Blocker
 
Dependants
 
Reference
 
dependent
 

Hadoop Flags: Reviewed
Release Note: BZip2 files can now be split.
Resolution Date: 10/Sep/09 08:52 PM


 Description  « Hide
Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).

We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Abdul Qadeer made changes - 23/Aug/08 03:40 AM
Field Original Value New Value
Description Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed
by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).

We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's
compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.
Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).

We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.
Abdul Qadeer made changes - 23/Aug/08 03:46 AM
Link This issue relates to HADOOP-3646 [ HADOOP-3646 ]
Abdul Qadeer made changes - 23/Aug/08 03:49 AM
Link This issue depends on HADOOP-4010 [ HADOOP-4010 ]
Abdul Qadeer made changes - 23/Aug/08 03:50 AM
Link This issue depends on HADOOP-4010 [ HADOOP-4010 ]
Abdul Qadeer made changes - 23/Aug/08 03:51 AM
Link This issue depends upon HADOOP-4010 [ HADOOP-4010 ]
Abdul Qadeer made changes - 22/Nov/08 07:58 AM
Attachment Hadoop-4012-version1.patch [ 12394477 ]
Abdul Qadeer made changes - 22/Nov/08 07:59 AM
Fix Version/s 0.19.1 [ 12313473 ]
Status Open [ 1 ] Patch Available [ 10002 ]
Abdul Qadeer made changes - 24/Nov/08 02:09 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer made changes - 24/Nov/08 02:16 AM
Attachment Hadoop-4012-version2.patch [ 12394526 ]
Abdul Qadeer made changes - 24/Nov/08 02:16 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Abdul Qadeer made changes - 25/Nov/08 08:28 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Abdul Qadeer made changes - 26/Nov/08 02:29 AM
Attachment Hadoop-4012-version3.patch [ 12394714 ]
Abdul Qadeer made changes - 26/Nov/08 02:30 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Abdul Qadeer made changes - 28/Nov/08 05:45 AM
Attachment Hadoop-4012-version4.patch [ 12394873 ]
Abdul Qadeer made changes - 28/Nov/08 05:45 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer made changes - 28/Nov/08 05:46 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Chris Douglas made changes - 30/Nov/08 02:46 AM
Affects Version/s 0.19.0 [ 12313211 ]
Fix Version/s 0.19.1 [ 12313473 ]
Fix Version/s 0.20.0 [ 12313438 ]
Chris Douglas made changes - 04/Dec/08 04:22 AM
Fix Version/s 0.20.0 [ 12313438 ]
Status Patch Available [ 10002 ] Open [ 1 ]
Abdul Qadeer made changes - 31/Mar/09 02:39 PM
Attachment Hadoop-4012-version5.patch [ 12404241 ]
Suhas Gogate made changes - 31/Mar/09 07:35 PM
Link This issue relates to HADOOP-5601 [ HADOOP-5601 ]
Suhas Gogate made changes - 31/Mar/09 07:35 PM
Link This issue relates to HADOOP-5602 [ HADOOP-5602 ]
Abdul Qadeer made changes - 02/Apr/09 01:28 PM
Fix Version/s 0.19.2 [ 12313650 ]
Affects Version/s 0.19.2 [ 12313650 ]
Release Note Support to process Hadoop split BZip2 files.
Status Open [ 1 ] Patch Available [ 10002 ]
Abdul Qadeer made changes - 06/Apr/09 11:09 AM
Attachment Hadoop-4012-version6.patch [ 12404711 ]
Abdul Qadeer made changes - 06/Apr/09 11:15 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer made changes - 06/Apr/09 11:16 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Abdul Qadeer made changes - 16/Apr/09 11:31 AM
Link This issue depends on HADOOP-5213 [ HADOOP-5213 ]
Abdul Qadeer made changes - 14/May/09 12:27 PM
Attachment Hadoop-4012-version7.patch [ 12408129 ]
Abdul Qadeer made changes - 15/May/09 06:22 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer made changes - 15/May/09 06:23 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Johan Oskarsson made changes - 22/May/09 01:49 PM
Fix Version/s 0.19.2 [ 12313650 ]
Fix Version/s 0.21.0 [ 12313563 ]
Abdul Qadeer made changes - 27/May/09 06:19 AM
Affects Version/s 0.19.2 [ 12313650 ]
Affects Version/s 0.21.0 [ 12313563 ]
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer made changes - 27/May/09 06:20 AM
Attachment Hadoop-4012-version8.patch [ 12409127 ]
Abdul Qadeer made changes - 27/May/09 06:21 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Abdul Qadeer made changes - 01/Jun/09 11:08 AM
Attachment Hadoop-4012-version9.patch [ 12409554 ]
Abdul Qadeer made changes - 02/Jun/09 08:55 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer made changes - 02/Jun/09 08:55 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Chris Douglas made changes - 17/Jun/09 09:48 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Abdul Qadeer made changes - 04/Aug/09 10:19 AM
Status Open [ 1 ] In Progress [ 3 ]
Abdul Qadeer made changes - 04/Aug/09 10:23 AM
Attachment Hadoop-4012-version10.patch [ 12415479 ]
Abdul Qadeer made changes - 04/Aug/09 10:24 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Abdul Qadeer made changes - 06/Aug/09 05:49 PM
Attachment Hadoop-4012-version11.patch [ 12415766 ]
Abdul Qadeer made changes - 06/Aug/09 07:41 PM
Link This issue blocks MAPREDUCE-830 [ MAPREDUCE-830 ]
Abdul Qadeer made changes - 08/Aug/09 11:00 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer made changes - 08/Aug/09 11:02 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Chris Douglas made changes - 30/Aug/09 09:47 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Chris Douglas made changes - 02/Sep/09 12:06 AM
Attachment C4012-12.patch [ 12418320 ]
Abdul Qadeer made changes - 03/Sep/09 09:04 AM
Attachment C4012-13.patch [ 12418490 ]
Abdul Qadeer made changes - 05/Sep/09 06:16 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Chris Douglas made changes - 08/Sep/09 05:13 AM
Attachment C4012-14.patch [ 12418879 ]
Chris Douglas made changes - 08/Sep/09 05:13 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Chris Douglas made changes - 08/Sep/09 05:13 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Chris Douglas made changes - 10/Sep/09 08:52 PM
Resolution Fixed [ 1 ]
Hadoop Flags [Reviewed]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Robert Chansler made changes - 08/Oct/09 04:04 AM
Release Note Support to process Hadoop split BZip2 files. BZip2 files can now be split.