
|
If you were logged in you would be able to see more operations.
|
|
|
|
File Attachments:
|
|
|
Issue Links:
|
Blocker
|
|
|
|
Dependants
|
|
This issue depends on:
|
|
HADOOP-5213
BZip2CompressionOutputStream NullPointerException
|
|
|
|
|
|
Reference
|
|
This issue relates to:
|
|
|
HADOOP-5602 existing Bzip2Codec supported in hadoop 0.19/0.20 skipps the input records when input bzip2 compressed files is made up of concatinating multiple .bz2 files.
|
|
|
|
 |
|
MAPREDUCE-477 Support for reading bzip2 compressed file created using concatenation of multiple .bz2 files
|
|
|
|
|
|
|
|
|
|
dependent
|
|
This issue depends upon:
|
|
MAPREDUCE-772
Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
|
|
|
|
|
|
|
|
| Hadoop Flags: |
Reviewed
|
| Release Note: |
BZip2 files can now be split.
|
| Resolution Date: |
10/Sep/09 08:52 PM
|
Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.
BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).
We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.
|
|
Description
|
Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.
BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).
We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code. |
Show » |
made changes - 23/Aug/08 03:40 AM
| Field |
Original Value |
New Value |
|
Description
|
Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.
BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed
by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).
We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's
compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.
|
Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.
BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).
We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.
|
made changes - 23/Aug/08 03:49 AM
|
Link
|
|
This issue depends on HADOOP-4010
[ HADOOP-4010
]
|
made changes - 23/Aug/08 03:50 AM
|
Link
|
This issue depends on HADOOP-4010
[ HADOOP-4010
]
|
|
made changes - 23/Aug/08 03:51 AM
|
Link
|
|
This issue depends upon HADOOP-4010
[ HADOOP-4010
]
|
made changes - 22/Nov/08 07:58 AM
made changes - 22/Nov/08 07:59 AM
|
Fix Version/s
|
|
0.19.1
[ 12313473
]
|
|
Status
|
Open
[ 1
]
|
Patch Available
[ 10002
]
|
made changes - 24/Nov/08 02:09 AM
|
Status
|
Patch Available
[ 10002
]
|
In Progress
[ 3
]
|
made changes - 24/Nov/08 02:16 AM
made changes - 24/Nov/08 02:16 AM
|
Status
|
In Progress
[ 3
]
|
Patch Available
[ 10002
]
|
made changes - 25/Nov/08 08:28 AM
|
Status
|
Patch Available
[ 10002
]
|
Open
[ 1
]
|
made changes - 26/Nov/08 02:29 AM
made changes - 26/Nov/08 02:30 AM
|
Status
|
Open
[ 1
]
|
Patch Available
[ 10002
]
|
made changes - 28/Nov/08 05:45 AM
made changes - 28/Nov/08 05:45 AM
|
Status
|
Patch Available
[ 10002
]
|
In Progress
[ 3
]
|
made changes - 28/Nov/08 05:46 AM
|
Status
|
In Progress
[ 3
]
|
Patch Available
[ 10002
]
|
made changes - 30/Nov/08 02:46 AM
|
Affects Version/s
|
0.19.0
[ 12313211
]
|
|
|
Fix Version/s
|
0.19.1
[ 12313473
]
|
|
|
Fix Version/s
|
|
0.20.0
[ 12313438
]
|
made changes - 04/Dec/08 04:22 AM
|
Fix Version/s
|
0.20.0
[ 12313438
]
|
|
|
Status
|
Patch Available
[ 10002
]
|
Open
[ 1
]
|
made changes - 31/Mar/09 02:39 PM
made changes - 31/Mar/09 07:35 PM
|
Link
|
|
This issue relates to HADOOP-5601
[ HADOOP-5601
]
|
made changes - 02/Apr/09 01:28 PM
|
Fix Version/s
|
|
0.19.2
[ 12313650
]
|
|
Affects Version/s
|
|
0.19.2
[ 12313650
]
|
|
Release Note
|
|
Support to process Hadoop split BZip2 files.
|
|
Status
|
Open
[ 1
]
|
Patch Available
[ 10002
]
|
made changes - 06/Apr/09 11:09 AM
made changes - 06/Apr/09 11:15 AM
|
Status
|
Patch Available
[ 10002
]
|
In Progress
[ 3
]
|
made changes - 06/Apr/09 11:16 AM
|
Status
|
In Progress
[ 3
]
|
Patch Available
[ 10002
]
|
made changes - 14/May/09 12:27 PM
made changes - 15/May/09 06:22 AM
|
Status
|
Patch Available
[ 10002
]
|
In Progress
[ 3
]
|
made changes - 15/May/09 06:23 AM
|
Status
|
In Progress
[ 3
]
|
Patch Available
[ 10002
]
|
made changes - 22/May/09 01:49 PM
|
Fix Version/s
|
0.19.2
[ 12313650
]
|
|
|
Fix Version/s
|
|
0.21.0
[ 12313563
]
|
made changes - 27/May/09 06:19 AM
|
Affects Version/s
|
0.19.2
[ 12313650
]
|
|
|
Affects Version/s
|
|
0.21.0
[ 12313563
]
|
|
Status
|
Patch Available
[ 10002
]
|
In Progress
[ 3
]
|
made changes - 27/May/09 06:20 AM
made changes - 27/May/09 06:21 AM
|
Status
|
In Progress
[ 3
]
|
Patch Available
[ 10002
]
|
made changes - 01/Jun/09 11:08 AM
made changes - 02/Jun/09 08:55 AM
|
Status
|
Patch Available
[ 10002
]
|
In Progress
[ 3
]
|
made changes - 02/Jun/09 08:55 AM
|
Status
|
In Progress
[ 3
]
|
Patch Available
[ 10002
]
|
made changes - 17/Jun/09 09:48 PM
|
Status
|
Patch Available
[ 10002
]
|
Open
[ 1
]
|
made changes - 04/Aug/09 10:19 AM
|
Status
|
Open
[ 1
]
|
In Progress
[ 3
]
|
made changes - 04/Aug/09 10:23 AM
made changes - 04/Aug/09 10:24 AM
|
Status
|
In Progress
[ 3
]
|
Patch Available
[ 10002
]
|
made changes - 06/Aug/09 05:49 PM
made changes - 08/Aug/09 11:00 AM
|
Status
|
Patch Available
[ 10002
]
|
In Progress
[ 3
]
|
made changes - 08/Aug/09 11:02 AM
|
Status
|
In Progress
[ 3
]
|
Patch Available
[ 10002
]
|
made changes - 30/Aug/09 09:47 PM
|
Status
|
Patch Available
[ 10002
]
|
Open
[ 1
]
|
made changes - 02/Sep/09 12:06 AM
|
Attachment
|
|
C4012-12.patch
[ 12418320
]
|
made changes - 03/Sep/09 09:04 AM
|
Attachment
|
|
C4012-13.patch
[ 12418490
]
|
made changes - 05/Sep/09 06:16 AM
|
Status
|
Open
[ 1
]
|
Patch Available
[ 10002
]
|
made changes - 08/Sep/09 05:13 AM
|
Attachment
|
|
C4012-14.patch
[ 12418879
]
|
made changes - 08/Sep/09 05:13 AM
|
Status
|
Patch Available
[ 10002
]
|
Open
[ 1
]
|
made changes - 08/Sep/09 05:13 AM
|
Status
|
Open
[ 1
]
|
Patch Available
[ 10002
]
|
made changes - 10/Sep/09 08:52 PM
|
Resolution
|
|
Fixed
[ 1
]
|
|
Hadoop Flags
|
|
[Reviewed]
|
|
Status
|
Patch Available
[ 10002
]
|
Resolved
[ 5
]
|
made changes - 08/Oct/09 04:04 AM
|
Release Note
|
Support to process Hadoop split BZip2 files.
|
BZip2 files can now be split.
|
|