[MAPREDUCE-2094] LineRecordReader should not seek into non-splittable, compressed streams. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: task
Labels:
None

Hadoop Flags:

Reviewed

Description

When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times.

A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed.

It took a while to figure out and what we found is that the default implementation of the isSplittable method in org.apache.hadoop.mapreduce.lib.input.FileInputFormat is simply "return true;".

This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: "Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. " . The actual implementation effectively does "Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. "

For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does "return false; "

Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable):

Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes.
"Force" developers to think about it and make this method abstract.
Use a "safe" default (i.e. return false)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-2094-2011-05-19.patch
19/May/11 14:28
6 kB
Niels Basjes
MAPREDUCE-2094-FileInputFormat-docs-v2.patch
08/Jul/14 22:30
4 kB
Gian Merlino
MAPREDUCE-2094-20140727.patch
27/Jul/14 18:47
11 kB
Niels Basjes
MAPREDUCE-2094-20140727-svn.patch
27/Jul/14 20:19
10 kB
Niels Basjes
MAPREDUCE-2094-20140727-svn-fixed-spaces.patch
05/May/15 20:52
11 kB
Niels Basjes
MAPREDUCE-2094-2015-05-05-2328.patch
05/May/15 21:29
11 kB
Niels Basjes
M2094.patch
08/May/15 20:53
10 kB
Christopher Douglas
M2094-1.patch
08/May/15 21:33
11 kB
Christopher Douglas

Issue Links

relates to

HADOOP-6901 Parsing large compressed files with HADOOP-1722 spawns multiple mappers per file

Resolved

Activity

People

Assignee:: Niels Basjes

Reporter:: Niels Basjes

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 28/Sep/10 09:40

Updated:: 30/Aug/16 01:20

Resolved:: 08/May/15 21:36