Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1487

io.DataInputBuffer.getLength() semantic wrong/confused


    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.1, 0.20.2, 0.21.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Environment:



      I was trying Google Protocol Buffer as a value type on hadoop,
      then when I used it in reducer, the parser always failed.
      while it worked fine on a plain inputstream reader or mapper.

      the reason is that the reducer interface in Task.java gave a buffer larger than an actual encoded record to the parser, and the parser does not stop until it reaches
      the buffer end, so it parsed some junk bytes.

      the root cause is due to hadoop.io.DataInputBuffer.java :

      in 0.20.1 DataInputBuffer.java line 47:

      public void reset(byte[] input, int start, int length)

      { this.buf = input; this.count = start+length; this.mark = start; this.pos = start; }

      public byte[] getData()

      { return buf; }

      public int getPosition()

      { return pos; }

      public int getLength()

      { return count; }

      we see that the above logic seems to assume that "getLength()" returns the total ** capacity ***, not the actual content length, of the buffer, yet latter code
      seems to assume the semantic that "length" is actual content length, i.e. end - start :

      /** Resets the data that the buffer reads. */
      public void reset(byte[] input, int start, int length)

      { buffer.reset(input, start, length); }

      i.e. if u call reset( getPosition(), getLength() ) on the same buffer again and again, the "length" would be infinitely increased.

      this confusion in semantic is reflected in many places, at leat in IFile.java, and Task.java, where it caused the original issue.
      around line 980 of Task.java, we see
      valueIn.reset(nextValueBytes.getData(), nextValueBytes.getPosition(), nextValueBytes.getLength())
      if the position above is not empty, the above actually sets a buffer too long, causing the reported issue.

      changing the Task.java as a hack , to
      valueIn.reset(nextValueBytes.getData(), nextValueBytes.getPosition(), nextValueBytes.getLength() - nextValueBytes.getPosition());

      fixed the issue, but the semantic of DataInputBuffer should be fixed and streamlined


        Pavel Podgoretsky made changes -
        Affects Version/s 0.21.0 [ 12314045 ]
        Sarthak made changes -
        Field Original Value New Value
        Affects Version/s 0.20.2 [ 12314205 ]
        Yang Yang created issue -


          • Assignee:
            Yang Yang
          • Votes:
            5 Vote for this issue
            4 Start watching this issue


            • Created: