Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5572

Provide alternative logic for getPos() implementation in custom RecordReader of mapred implementation of MultiFileWordCount

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.1.0, 1.1.1, 1.2.0, 1.1.3, 1.2.1, 1.2.2
    • None
    • examples
    • None

    Description

      The custom RecordReader class in MultiFileWordCount (MultiFileLineRecordReader) has been replaced in newer examples with a better implementation which uses the CombineFileInputFormat, which doesn't feature this bug. However, this bug nevertheless still exists in 1.x versions of the MultiFileWordCount which rely on the mapred API.

      The older MultiFileWordCount implementation defines the getPos() as follows:

      long currentOffset = currentStream == null ? 0 : currentStream.getPos();
      ...

      This is meant to prevent errors when underlying stream is null. But it doesn't gaurantee to work: The RawLocalFileSystem, for example, currectly will close the underlying file stream once it is consumed, and the currentStream will thus throw a NullPointerException when trying to access the null stream.

      This is only seen when running this in the context where the MapTask class, which is only relevant in mapred.* API, calls getPos() twice in tandem, before and after reading a record.

      This custom record reader should be gaurded, or else eliminated, since it assumes something which is not in the FileSystem contract: That a getPos will always return a integral value.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jayunit100 jay vyas
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: