[HADOOP-788] Streaming should use a subclass of TextInputFormat for reading text inputs. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.11.0
Component/s: None
Labels:
None

Description

Currently streaming uses a lot of custom code for processing text inputs.

I propose:

1. Move class LineRecordReader out of TextInputFormat.
2. Make class StreamLineRecordReader extend LineRecordReader.
3. StreamLineRecordReader uses LineRecordReader.next to read the lines and splits them on tab to generate a Text/Text key/value pair.

This will remove a lot of code from streaming and give it automatic support for the compression codecs that the "base" part of Hadoop enjoys. In particular, if the native zlib code is used, it will remove the 2gb limit on compressed files.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Hadoop-788.patch
16/Jan/07 11:40
20 kB
Sanjay Dahiya

Issue Links

depends upon

HADOOP-619 Unify Map-Reduce and Streaming to take the same globbed input specification

Closed

Activity

People

Assignee:: Sanjay Dahiya

Reporter:: Owen O'Malley

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 06/Dec/06 21:13

Updated:: 02/May/13 02:29

Resolved:: 31/Jan/07 22:26