Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 4.8
    • Fix Version/s: 4.8, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I was playing with on-the-fly checksum verification and this made me stumble upon an issue with BufferedChecksumIndexInput.

      I have some code that skips over a DataInput by reading bytes into /dev/null, eg.

        private static final byte[] SKIP_BUFFER = new byte[1024];
      
        private static void skipBytes(DataInput in, long numBytes) throws IOException {
          assert numBytes >= 0;
          for (long skipped = 0; skipped < numBytes; ) {
            final int toRead = (int) Math.min(numBytes - skipped, SKIP_BUFFER.length);
            in.readBytes(SKIP_BUFFER, 0, toRead);
            skipped += toRead;
          }
        }
      

      It is fine to read into this static buffer, even from multiple threads, since the content that is read doesn't matter here. However, it breaks with BufferedChecksumIndexInput because of the way that it updates the checksum:

        @Override
        public void readBytes(byte[] b, int offset, int len)
          throws IOException {
          main.readBytes(b, offset, len);
          digest.update(b, offset, len);
        }
      

      If you are unlucky enough so that a concurrent call to skipBytes started modifying the content of b before the call to digest.update(b, offset, len) finished, then your checksum will be wrong.

      I think we should make BufferedChecksumIndexInput read into a private buffer first instead of relying on the user-provided buffer.

      1. LUCENE-5583.patch
        7 kB
        Adrien Grand
      2. LUCENE-5583.patch
        7 kB
        Adrien Grand

        Activity

        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0
        Hide
        Adrien Grand added a comment -

        Committed to both branches, thanks Simon, Uwe and Mike!

        Show
        Adrien Grand added a comment - Committed to both branches, thanks Simon, Uwe and Mike!
        Hide
        ASF subversion and git services added a comment -

        Commit 1586232 from jpountz@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1586232 ]

        LUCENE-5583: Add DataInput.skipBytes, ChecksumIndexInput can now seek forward.

        Show
        ASF subversion and git services added a comment - Commit 1586232 from jpountz@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1586232 ] LUCENE-5583 : Add DataInput.skipBytes, ChecksumIndexInput can now seek forward.
        Hide
        Uwe Schindler added a comment - - edited

        Thanks! After hopefully getting some comments on LUCENE-5588, I will branch 4.8!

        Show
        Uwe Schindler added a comment - - edited Thanks! After hopefully getting some comments on LUCENE-5588 , I will branch 4.8!
        Hide
        ASF subversion and git services added a comment -

        Commit 1586231 from jpountz@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1586231 ]

        LUCENE-5583: Add DataInput.skipBytes, ChecksumIndexInput can now seek forward.

        Show
        ASF subversion and git services added a comment - Commit 1586231 from jpountz@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1586231 ] LUCENE-5583 : Add DataInput.skipBytes, ChecksumIndexInput can now seek forward.
        Hide
        Adrien Grand added a comment -

        Good point, I'll fix it!

        Show
        Adrien Grand added a comment - Good point, I'll fix it!
        Hide
        Michael McCandless added a comment -

        +1, patch looks nice.

        Hmm, should ByteArrayDataInput.skipBytes should be marked @Override (and I guess change its param to long)?

        Show
        Michael McCandless added a comment - +1, patch looks nice. Hmm, should ByteArrayDataInput.skipBytes should be marked @Override (and I guess change its param to long)?
        Hide
        Uwe Schindler added a comment -

        Looks fine!

        Show
        Uwe Schindler added a comment - Looks fine!
        Hide
        Adrien Grand added a comment -

        Here is an updated patch that stores the skip buffer as an instance member.

        Show
        Adrien Grand added a comment - Here is an updated patch that stores the skip buffer as an instance member.
        Hide
        Uwe Schindler added a comment -

        Thanks! Static thread locals are horrible for the garbage collector, so I would really don't use them.

        Show
        Uwe Schindler added a comment - Thanks! Static thread locals are horrible for the garbage collector, so I would really don't use them.
        Hide
        Adrien Grand added a comment -

        I didn't think much about the thread-local, I just thought it would be nice to minimize the number of buffer instances. I will make it work like copyBytes for consistency.

        Show
        Adrien Grand added a comment - I didn't think much about the thread-local, I just thought it would be nice to minimize the number of buffer instances. I will make it work like copyBytes for consistency.
        Hide
        Uwe Schindler added a comment -

        An alternative would be to use a non-static buffer (like in IndexOutput#copyBytes()). Because a single IndexInput or Indexoutput cannot be used by more than one thread without cloning, there is no need for a thread local. It is just a problem, if its static for all instances.

        Show
        Uwe Schindler added a comment - An alternative would be to use a non-static buffer (like in IndexOutput#copyBytes() ). Because a single IndexInput or Indexoutput cannot be used by more than one thread without cloning, there is no need for a thread local. It is just a problem, if its static for all instances.
        Hide
        Uwe Schindler added a comment -

        Do we really need the skipBuffer thread local? I would remove this and instantiate the skipbuffer with correct size, max 1024. We had some similar code in copyBytes() of DataOutput/DataInput and removed that because it is stupid with newer JVMs.

        JVMs are very good for those short-lived objects, no need to cache them. They are also quite small. In any case we should only allocate as much bytes we really need (new byte[Math.min(toSkip, 1024)].

        Show
        Uwe Schindler added a comment - Do we really need the skipBuffer thread local? I would remove this and instantiate the skipbuffer with correct size, max 1024. We had some similar code in copyBytes() of DataOutput/DataInput and removed that because it is stupid with newer JVMs. JVMs are very good for those short-lived objects, no need to cache them. They are also quite small. In any case we should only allocate as much bytes we really need ( new byte [Math.min(toSkip, 1024)] .
        Hide
        Adrien Grand added a comment -

        Here is a patch.

        Show
        Adrien Grand added a comment - Here is a patch.
        Hide
        Adrien Grand added a comment -

        Thanks for the feedback, I'll work on a patch.

        Show
        Adrien Grand added a comment - Thanks for the feedback, I'll work on a patch.
        Hide
        Uwe Schindler added a comment -

        +1 on having SkipBytes.

        One other thing: Currently ChecksumIndexInput throws UnsupportedEx if you try to seek, but in a subclass we suddenly allow it again (only forward). Maybe we should move the code up to ChecksumIndexinput and document that seeking only works forward and may be costly (because it has to read)? In that case we implement that with skipBytes(), too. This would allow to use ChecksumIndexinput also for other codec parts while merging.

        Show
        Uwe Schindler added a comment - +1 on having SkipBytes. One other thing: Currently ChecksumIndexInput throws UnsupportedEx if you try to seek, but in a subclass we suddenly allow it again (only forward). Maybe we should move the code up to ChecksumIndexinput and document that seeking only works forward and may be costly (because it has to read)? In that case we implement that with skipBytes(), too. This would allow to use ChecksumIndexinput also for other codec parts while merging.
        Hide
        Adrien Grand added a comment -

        I personally think you shouldn't pass this shared buffer to readBytes() it can break all delegates.

        We already have code that does it for stored fields (the piece of code I pasted in the issue description comes from CompressingStoredFieldsReader.skipBytes`). But indeed, this is the other option: either this way of skipping over bytes using a write-only buffer is considered wrong, or DataInput implementations should never read in the user-provided buffer.

        +1 on having skipBytes on DataInput although I would rather like the default impl to do bulk reads instead of reading bytes one by one.

        Show
        Adrien Grand added a comment - I personally think you shouldn't pass this shared buffer to readBytes() it can break all delegates. We already have code that does it for stored fields (the piece of code I pasted in the issue description comes from CompressingStoredFieldsReader.skipBytes `). But indeed, this is the other option: either this way of skipping over bytes using a write-only buffer is considered wrong, or DataInput implementations should never read in the user-provided buffer. +1 on having skipBytes on DataInput although I would rather like the default impl to do bulk reads instead of reading bytes one by one.
        Hide
        Simon Willnauer added a comment -

        I personally think you shouldn't pass this shared buffer to readBytes() it can break all delegates. I wonder if we want to add a skipBytes method to DataInput that we can impl. efficienly on the lower levels and that just calls readByte() in a loop as a default impl?

        Show
        Simon Willnauer added a comment - I personally think you shouldn't pass this shared buffer to readBytes() it can break all delegates. I wonder if we want to add a skipBytes method to DataInput that we can impl. efficienly on the lower levels and that just calls readByte() in a loop as a default impl?

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development