[HADOOP-374] native support for gzipped text files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.2
Component/s: None
Labels:
None

Description

in many cases it is convenient to store text files in dfs as gzip compressed files.
It would be good to have built in support for processing these files in a mapreduce job.

The getSplits implementation should return a single split per input file, ignoring the numSplits parameter.
One can probably subclass InputFormatBase, and the getSplits method can simply call listPaths()
and then construct and return a single split per path returned.

The code for reading would look something like (courtesy of Vijay Murthy):

public RecordReader getRecordReader(FileSystem fs, FileSplit split,
JobConf job, Reporter reporter)
throws IOException {
final BufferedReader in =
new BufferedReader(new InputStreamReader
(new GZIPInputStream(fs.open(split.getPath()))));
return new RecordReader() {
long position;
public synchronized boolean next(Writable key, Writable value)
throws IOException {
String line = in.readLine();
if (line != null)

{ position += line.length(); ((UTF8)value).set(line); return true; }

return false;
}
public synchronized long getPos() throws IOException

{ return position; }

public synchronized void close() throws IOException

{ in.close(); }

};
}

Attachments

Issue Links

duplicates

HADOOP-474 support compressed text files as input and output

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Yoram Arnon

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 20/Jul/06 18:14

Updated:: 08/Jul/09 16:51

Resolved:: 26/Sep/06 05:42