Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-788

Streaming should use a subclass of TextInputFormat for reading text inputs.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.11.0
    • None
    • None

    Description

      Currently streaming uses a lot of custom code for processing text inputs.

      I propose:

      1. Move class LineRecordReader out of TextInputFormat.
      2. Make class StreamLineRecordReader extend LineRecordReader.
      3. StreamLineRecordReader uses LineRecordReader.next to read the lines and splits them on tab to generate a Text/Text key/value pair.

      This will remove a lot of code from streaming and give it automatic support for the compression codecs that the "base" part of Hadoop enjoys. In particular, if the native zlib code is used, it will remove the 2gb limit on compressed files.

      Attachments

        1. Hadoop-788.patch
          20 kB
          Sanjay Dahiya

        Issue Links

          Activity

            People

              sanjay.dahiya Sanjay Dahiya
              omalley Owen O'Malley
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: