Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1489

Input file get truncated for text files with \r\n

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.13.0
    • 0.14.0
    • io
    • None

    Description

      When input file has \r\n, LineRecordReader uses mark()/reset() to read one byte ahead to check if \r is followed by \n. This probably caused the BufferedInputStream to issue a small read request (e.g., 127 bytes). The ChecksumFileSystem.FSInputChecker.read() code

         public int read(byte b[], int off, int len) throws IOException {
           // make sure that it ends at a checksum boundary
           long curPos = getPos();
           long endPos = len+curPos/bytesPerSum*bytesPerSum;
           return readBuffer(b, off, (int)(endPos-curPos));
         }
      

      tries to truncate "len" to checksum boundary. For DFS, bytesPerSum is 512. So for small reads, the truncated length become negative (i.e., endPos - curPos is < 0). The underlying DFS read returns 0 when length is negative. However, readBuffer changes it to -1 assuming end-of-file has been reached. This means effectively, the rest of the input file did not get read. In my case, only 8MB of a 52MB file is actually read. Two sample stacks are appended.

      One related issue, if there are assumptions (such as len >= bytesPerSum) in FSInputChecker's read(), would it be ok to add a check that throws an exception when the assumption is violated? This assumption is a bit unusal and as code changes (both Hadoop and Java's implementation of BufferedInputStream), the assumption may get violated. This silently dropping large part of input seems really difficult for people to notice (and debug) when people start to deal with terabytes of data. Also, I suspect the performance impact for such a check would not be noticed.

      bwolen

      Here are two sample stacks. (i have readbuffer throw when it gets 0 bytes, and have inputchecker catches the exception and rethrow both. This way, I catch the values from both caller and callee (see the callee one starts with "Caused by")

      -------------------------------------

      java.lang.RuntimeException: end of read()
      in=org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker len=127
      pos=45223932 res=-999999
             at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:50)
             at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
             at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
             at org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:116)
             at java.io.FilterInputStream.read(FilterInputStream.java:66)
             at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:132)
             at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:124)
             at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:108)
             at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:168)
             at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
             at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
             at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1720)
      
      
      Caused by: java.lang.RuntimeException: end of read()
      datas=org.apache.hadoop.dfs.DFSClient$DFSDataInputStream pos=45223932
      len=-381 bytesPerSum=512 eof=false read=0
             at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:200)
             at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175)
             at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:47)
             ... 11 more
      ---------------
      
      java.lang.RuntimeException: end of read()  in=org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker len=400 pos=4503 res=-999999
      	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:50)
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
      	at org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:116)
      	at java.io.FilterInputStream.read(FilterInputStream.java:66)
      	at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:132)
      	at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:124)
      	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:108)
      	at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:168)
      	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
      	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
      	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1720)
      
      Caused by: java.lang.RuntimeException: end of read()  datas=org.apache.hadoop.dfs.DFSClient$DFSDataInputStream pos=4503 len=-7 bytesPerSum=512 eof=false read=0
      	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:200)
      	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175)
      	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:47)
      	... 11 more
      
      

      Attachments

        1. HADOOP-1489.2.patch
          7 kB
          Bwolen Yang
        2. HADOOP-1489.patch
          9 kB
          Bwolen Yang
        3. slashr33.txt
          44 kB
          Bwolen Yang
        4. MRIdentity.java
          0.6 kB
          Bwolen Yang

        Activity

          People

            Unassigned Unassigned
            wbwolen Bwolen Yang
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: