Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-2353

Test improvments to java.io.InputStream.seek() for possible Hadoop patch

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None
    • Java 6 update 45 or later
      Hadoop 2.2.0

    Description

      At some point (early Java 7 I think, then backported to around Java 6 Update 45), the java.io.InputStream.seek() method was changed from reading byte[512] to byte[2048]. The difference can be seen in DeflaterInputStream, which has not been updated:

          public long skip(long n) throws IOException {
              if (n < 0) {
                  throw new IllegalArgumentException("negative skip length");
              }
              ensureOpen();
      
              // Skip bytes by repeatedly decompressing small blocks
              if (rbuf.length < 512)
                  rbuf = new byte[512];
      
              int total = (int)Math.min(n, Integer.MAX_VALUE);
              long cnt = 0;
              while (total > 0) {
                  // Read a small block of uncompressed bytes
                  int len = read(rbuf, 0, (total <= rbuf.length ? total : rbuf.length));
      
                  if (len < 0) {
                      break;
                  }
                  cnt += len;
                  total -= len;
              }
              return cnt;
          }
      

      and java.io.InputStream in Java 6 Update 45:

          // MAX_SKIP_BUFFER_SIZE is used to determine the maximum buffer skip to
          // use when skipping.
          private static final int MAX_SKIP_BUFFER_SIZE = 2048;
      
          public long skip(long n) throws IOException {
      
      	long remaining = n;
      	int nr;
      
      	if (n <= 0) {
      	    return 0;
      	}
      	
      	int size = (int)Math.min(MAX_SKIP_BUFFER_SIZE, remaining);
      	byte[] skipBuffer = new byte[size];
      
      	while (remaining > 0) {
      	    nr = read(skipBuffer, 0, (int)Math.min(size, remaining));
      	    
      	    if (nr < 0) {
      		break;
      	    }
      	    remaining -= nr;
      	}
      	
      	return n - remaining;
          }
      

      In sample tests I saw about a 20% improvement in skip() when seeking towards the end of a locally cached compressed file. Looking at the DecompressorStream in HDFS, the seek method is a near copy of the old InputStream method:

        private byte[] skipBytes = new byte[512];
        @Override
        public long skip(long n) throws IOException {
          // Sanity checks
          if (n < 0) {
            throw new IllegalArgumentException("negative skip length");
          }
          checkStream();
          
          // Read 'n' bytes
          int skipped = 0;
          while (skipped < n) {
            int len = Math.min(((int)n - skipped), skipBytes.length);
            len = read(skipBytes, 0, len);
            if (len == -1) {
              eof = true;
              break;
            }
            skipped += len;
          }
          return skipped;
        }
      

      This task is to evaluate the changes to DecompressorStream with a possible patch to HDFS and possible bug request to Oracle to port the InputStream.seek changes to DeflaterInputStream.seek

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dlmarion Dave Marion
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: