Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Currently streaming uses a lot of custom code for processing text inputs.
I propose:
1. Move class LineRecordReader out of TextInputFormat.
2. Make class StreamLineRecordReader extend LineRecordReader.
3. StreamLineRecordReader uses LineRecordReader.next to read the lines and splits them on tab to generate a Text/Text key/value pair.
This will remove a lot of code from streaming and give it automatic support for the compression codecs that the "base" part of Hadoop enjoys. In particular, if the native zlib code is used, it will remove the 2gb limit on compressed files.
Attachments
Attachments
Issue Links
- depends upon
-
HADOOP-619 Unify Map-Reduce and Streaming to take the same globbed input specification
- Closed