Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5511

Multifilewc and the mapred.* API: Is the use of getPos() valid?

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.0.0, 1.2.0
    • None
    • examples
    • None

    Description

      The MultiFileWordCount class in the hadoop examples libraries uses a record reader which switches between files. This behaviour can cause the RawLocalFileSystem to break in a concurrent environment because of the way buffering works (in RawLocalFileSystem, switching between streams results in a temproraily "null" inner stream, and that inner stream is called by the getPos() implementation in the custom RecordReader for MultiFileWordCount).

      There are basically 2 ways to handle this:

      1) Wrap the getPos() implementation in the object returned by open() in the RawLocalFileSystem to cache the value of getPos() everytime it is called, so that calls to getPos() can return a valid long even if underlying stream is null. OR

      2) Update the RecordReader in multifilewc to not rely on the inner input stream and cache the position / return 0 if the stream cannot return a valid value.

      The final question here is: Is the RecordReader for MultiFileWordCount doing the right thing ? Or is it breaking the contract of getPos()... and really... what SHOULD getPos() return if the underlying stream has already been consumed?

      Attachments

        Activity

          People

            Unassigned Unassigned
            jayunit100 jay vyas
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: