Lucene - Core
  1. Lucene - Core
  2. LUCENE-3738

Be consistent about negative vInt/vLong

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today, write/readVInt "allows" a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes).

      However, read/writeVLong fails (trips an assert).

      I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that.

      So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!).

      1. LUCENE-3738.patch
        0.6 kB
        Uwe Schindler
      2. LUCENE-3738.patch
        1 kB
        Uwe Schindler
      3. LUCENE-3738.patch
        1 kB
        Michael McCandless
      4. LUCENE-3738.patch
        11 kB
        Uwe Schindler
      5. LUCENE-3738.patch
        14 kB
        Uwe Schindler
      6. ByteArrayDataInput.java.patch
        4 kB
        Uwe Schindler
      7. LUCENE-3738-improvement.patch
        7 kB
        Uwe Schindler

        Issue Links

          Activity

          Michael McCandless created issue -
          Hide
          Uwe Schindler added a comment - - edited

          That was my fault when the "Schindler VM" instead the "Hotspot VM" unrolled the loop (Schindler loop optimizer bug). I unrolled maximum of 9 bytes not 10 (which is wasteful, too).

          We had no negative VLongs until now, so that was no issue

          Show
          Uwe Schindler added a comment - - edited That was my fault when the "Schindler VM" instead the "Hotspot VM" unrolled the loop (Schindler loop optimizer bug). I unrolled maximum of 9 bytes not 10 (which is wasteful, too). We had no negative VLongs until now, so that was no issue
          Hide
          Uwe Schindler added a comment -

          So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!).

          As the code is written by a loop, we would write 10 bytes, of course the last one only with 1 bit. If we wont to spare that and optimize the long case to interpret the continuation bit in the last byte different (as part of data), the writer must also do that. Ideally we would unroll both loops.

          Show
          Uwe Schindler added a comment - So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). As the code is written by a loop, we would write 10 bytes, of course the last one only with 1 bit. If we wont to spare that and optimize the long case to interpret the continuation bit in the last byte different (as part of data), the writer must also do that. Ideally we would unroll both loops.
          Hide
          Robert Muir added a comment -

          As the code is written by a loop, we would write 10 bytes, of course the last one only with 1 bit.

          I think this is ok: otherwise its not a variable-length integer but something else
          (with a special case where 9th byte high bit means sign bit instead of continuation bit).

          Either way we should either fix it to assert value >=0 in the writer, or make it work.
          Ideally we would do that for both vint and vlong, but the problem is some things like term vectors
          sometimes write negative vints (since it does startOffset - lastEndOffset, if you have any synonyms
          you get tons of huge 5-byte vints in your term vectors)

          But currently its inconsistent: negatives don't trip any assert for either vint/vlong at write-time,
          but at read-time for vlong only.

          Show
          Robert Muir added a comment - As the code is written by a loop, we would write 10 bytes, of course the last one only with 1 bit. I think this is ok: otherwise its not a variable-length integer but something else (with a special case where 9th byte high bit means sign bit instead of continuation bit). Either way we should either fix it to assert value >=0 in the writer, or make it work. Ideally we would do that for both vint and vlong, but the problem is some things like term vectors sometimes write negative vints (since it does startOffset - lastEndOffset, if you have any synonyms you get tons of huge 5-byte vints in your term vectors) But currently its inconsistent: negatives don't trip any assert for either vint/vlong at write-time, but at read-time for vlong only .
          Hide
          Robert Muir added a comment -

          We had no negative VLongs until now, so that was no issue

          I think we should still avoid this!

          I think the javadocs should still say: negatives are unsupported.
          Maybe we can fix lucene 4's term vectors format to never write negatives,
          and in version 5 when 3.x indexes no longer need to be read, we can assert >= 0 at write-time?

          Show
          Robert Muir added a comment - We had no negative VLongs until now, so that was no issue I think we should still avoid this! I think the javadocs should still say: negatives are unsupported. Maybe we can fix lucene 4's term vectors format to never write negatives, and in version 5 when 3.x indexes no longer need to be read, we can assert >= 0 at write-time?
          Hide
          Uwe Schindler added a comment -

          The correct fix for longs would be to add one more unrolled loop iteration (for 10th byte). Then it would work with negative numbers as vInts. But very wasterful.

          About negative vInts: We have them unfortunately in pre-4.0 formats with version numbers. I think e.g. stored fields reader reads the first vInt from file, if its >=0 its a pre-very-ancient format and is some offset/count/foo/bar (no idea). If its negative, its a version number.

          We should fix the unrolled vLong reader in 3.x and trunk to have one more loop so its consisten with writer (the assert stays -> ist an assert for the continuation bit not set on last byte).

          We should try to never write negative numbers for post 4.0 formats, maybe add conditional assert (like for utf8 strings), so preflex can write using negative vints and dont trip assert.

          Show
          Uwe Schindler added a comment - The correct fix for longs would be to add one more unrolled loop iteration (for 10th byte). Then it would work with negative numbers as vInts. But very wasterful. About negative vInts: We have them unfortunately in pre-4.0 formats with version numbers. I think e.g. stored fields reader reads the first vInt from file, if its >=0 its a pre-very-ancient format and is some offset/count/foo/bar (no idea). If its negative, its a version number. We should fix the unrolled vLong reader in 3.x and trunk to have one more loop so its consisten with writer (the assert stays -> ist an assert for the continuation bit not set on last byte). We should try to never write negative numbers for post 4.0 formats, maybe add conditional assert (like for utf8 strings), so preflex can write using negative vints and dont trip assert.
          Hide
          Robert Muir added a comment - - edited

          We should try to never write negative numbers for post 4.0 formats, maybe add conditional assert (like for utf8 strings), so preflex can write using negative vints and dont trip assert.

          We still have this problem then with negatives in 4.0 formats (like stored fields).

          I think we should fix them all to use codec header, and fix term vectors writer.

          we could have a conditionalized assert for this in mockdirectorywrapper or something like that: it could be disabled when preflex is used for now.

          Show
          Robert Muir added a comment - - edited We should try to never write negative numbers for post 4.0 formats, maybe add conditional assert (like for utf8 strings), so preflex can write using negative vints and dont trip assert. We still have this problem then with negatives in 4.0 formats (like stored fields). I think we should fix them all to use codec header, and fix term vectors writer. we could have a conditionalized assert for this in mockdirectorywrapper or something like that: it could be disabled when preflex is used for now.
          Hide
          Uwe Schindler added a comment -

          Here the patch that simply fixes the "Schindler VM unroll bug".

          Show
          Uwe Schindler added a comment - Here the patch that simply fixes the "Schindler VM unroll bug".
          Uwe Schindler made changes -
          Field Original Value New Value
          Attachment LUCENE-3738.patch [ 12512570 ]
          Hide
          Uwe Schindler added a comment -

          Just to conclude: The bug with reading negative vLongs affects only DataInputs that dont override readVLong. So e.g. BufferedIndexInput is not affected. So when you read index from MMap or ByteArrayDataInput or InputStreamDataInput you will hit the bug.

          Show
          Uwe Schindler added a comment - Just to conclude: The bug with reading negative vLongs affects only DataInputs that dont override readVLong. So e.g. BufferedIndexInput is not affected. So when you read index from MMap or ByteArrayDataInput or InputStreamDataInput you will hit the bug.
          Hide
          Uwe Schindler added a comment -

          The BufferedIndexInput one has also this bug, only affecting reads at the boundaries (if the 10 bytes of a full int are no longer in buffer), in that case it throws AIOOBE:

          public long readVLong() throws IOException {
            if (9 <= bufferLength-bufferPosition) {
              byte b = buffer[bufferPosition++];
              long i = b & 0x7F;
              for (int shift = 7; (b & 0x80) != 0; shift += 7) {
                b = buffer[bufferPosition++];
                i |= (b & 0x7FL) << shift;
              }
              return i;
            } else {
              return super.readVLong();
            }
          }
          
          Show
          Uwe Schindler added a comment - The BufferedIndexInput one has also this bug, only affecting reads at the boundaries (if the 10 bytes of a full int are no longer in buffer), in that case it throws AIOOBE: public long readVLong() throws IOException { if (9 <= bufferLength-bufferPosition) { byte b = buffer[bufferPosition++]; long i = b & 0x7F; for ( int shift = 7; (b & 0x80) != 0; shift += 7) { b = buffer[bufferPosition++]; i |= (b & 0x7FL) << shift; } return i; } else { return super .readVLong(); } }
          Hide
          Uwe Schindler added a comment -

          Patch fixing both to be consistent/not-buggy.

          Show
          Uwe Schindler added a comment - Patch fixing both to be consistent/not-buggy.
          Uwe Schindler made changes -
          Attachment LUCENE-3738.patch [ 12512573 ]
          Hide
          Michael McCandless added a comment -

          I think we should disallow (assert) writing negative vLong, and, ideally disallow negative vInt also, by fixing all the places that rely on this (but, carefully... preflexrw will need a backdoor)...

          Separately: can't we strengthen the last assert in writeVInt to verify the top 4 bits are 0, not just the top bit? We have 36 bits at that point right? So top 4 should be unused (assert (b & 0xf0) == 0)

          Show
          Michael McCandless added a comment - I think we should disallow (assert) writing negative vLong, and, ideally disallow negative vInt also, by fixing all the places that rely on this (but, carefully... preflexrw will need a backdoor)... Separately: can't we strengthen the last assert in writeVInt to verify the top 4 bits are 0, not just the top bit? We have 36 bits at that point right? So top 4 should be unused (assert (b & 0xf0) == 0)
          Hide
          Uwe Schindler added a comment -

          If we disallow, it should be a hard check (no assert), as the data is coming from a file (and somebody could used a hex editor). The reader will crash later...

          Show
          Uwe Schindler added a comment - If we disallow, it should be a hard check (no assert), as the data is coming from a file (and somebody could used a hex editor). The reader will crash later...
          Hide
          Robert Muir added a comment -

          after investigating: its difficult to prevent negative offsets, even after fixing term vectors writer (LUCENE-3739)

          At first i tried a simple assert in BaseTokenStreamTestCase:

                  assertTrue("offsets must not go backwards", offsetAtt.startOffset() >= lastStartOffset);
                  lastStartOffset = offsetAtt.startOffset();
          

          Then these analyzers failed:

          • MockCharFilter itself had a bug, but thats easy to fix (LUCENE-3741)
          • synonymsfilter failed sometimes (LUCENE-3742) because it wrote zeros for offsets in situations like "a -> b c"
          • (edge)ngramtokenizers failed, because ngrams(1,2) of "ABCD" are not A, AB, B, BC, C, CD, D but instead A, B, C, D, AB, BC, CD, ...
          • (edge)ngramfilters failed for similar reasons.
          • worddelimiterfilter failed, because it doesnt break "AB" into A, AB, B but instead A, B, AB
          • trimfilter failed when 'offsets changing' is enabled, because if you have " rob", "robert" as synonyms then it trims the first, and the second offsets "go backwards"

          These are all bugs.

          In general I think offsets after being set should not be changed, because filters don't have access to any charfilters
          offset correction (correctOffset()) anyway, so they shouldnt be mucking offsets.

          So really: only the creator of tokens should make the offsets. And if thats a filter, it should be a standard way,
          only inherited from existing offsets and not 'offset mathematics' and not A, AB, B in some places and A, B, AB in others.

          Really i think we need to step it up if we want highlighting to be first-class citizen in lucene, nothing checks the offsets anyhwere at all,
          even to check/assert if they are negative, and there are little tests... all we have is some newish stuff in basetokenstreamtestcase and
          a few trivial test cases.

          On the other hand, for example, position increment's impl actually throws exception if you give it something like a negative number...

          Show
          Robert Muir added a comment - after investigating: its difficult to prevent negative offsets, even after fixing term vectors writer ( LUCENE-3739 ) At first i tried a simple assert in BaseTokenStreamTestCase: assertTrue( "offsets must not go backwards" , offsetAtt.startOffset() >= lastStartOffset); lastStartOffset = offsetAtt.startOffset(); Then these analyzers failed: MockCharFilter itself had a bug, but thats easy to fix ( LUCENE-3741 ) synonymsfilter failed sometimes ( LUCENE-3742 ) because it wrote zeros for offsets in situations like "a -> b c" (edge)ngramtokenizers failed, because ngrams(1,2) of "ABCD" are not A, AB, B, BC, C, CD, D but instead A, B, C, D, AB, BC, CD, ... (edge)ngramfilters failed for similar reasons. worddelimiterfilter failed, because it doesnt break "AB" into A, AB, B but instead A, B, AB trimfilter failed when 'offsets changing' is enabled, because if you have " rob", "robert" as synonyms then it trims the first, and the second offsets "go backwards" These are all bugs. In general I think offsets after being set should not be changed, because filters don't have access to any charfilters offset correction (correctOffset()) anyway, so they shouldnt be mucking offsets. So really: only the creator of tokens should make the offsets. And if thats a filter, it should be a standard way, only inherited from existing offsets and not 'offset mathematics' and not A, AB, B in some places and A, B, AB in others. Really i think we need to step it up if we want highlighting to be first-class citizen in lucene, nothing checks the offsets anyhwere at all, even to check/assert if they are negative, and there are little tests... all we have is some newish stuff in basetokenstreamtestcase and a few trivial test cases. On the other hand, for example, position increment's impl actually throws exception if you give it something like a negative number...
          Hide
          Robert Muir added a comment -

          LUCENE-3876/LUCENE-3879 reveal more situations where we must write negatives with the current encodings,
          because we steal bits from things like positions (payloads) and docids too (at least in skip data?)

          So sometimes its possible these are encoded as negatives.

          I think Uwe should commit his fixes?

          Show
          Robert Muir added a comment - LUCENE-3876 / LUCENE-3879 reveal more situations where we must write negatives with the current encodings, because we steal bits from things like positions (payloads) and docids too (at least in skip data?) So sometimes its possible these are encoded as negatives. I think Uwe should commit his fixes?
          Hide
          Michael McCandless added a comment -

          Hmm... I think we should think about it more.

          Ie, we apparently never write a negative vLong today... and I'm not sure we should start allowing it...?

          Show
          Michael McCandless added a comment - Hmm... I think we should think about it more. Ie, we apparently never write a negative vLong today... and I'm not sure we should start allowing it...?
          Hide
          Robert Muir added a comment -

          Well that differs with the title of the issue (consistency with vInt).

          I don't see how we can avoid negative vints. I think its ok to be inconsistent with vLong,
          but it should not be something we assert only at read-time. It should be asserted on write
          so that problems are found immediately.

          Show
          Robert Muir added a comment - Well that differs with the title of the issue (consistency with vInt). I don't see how we can avoid negative vints. I think its ok to be inconsistent with vLong, but it should not be something we assert only at read-time. It should be asserted on write so that problems are found immediately.
          Hide
          Michael McCandless added a comment -

          don't see how we can avoid negative vints. I think its ok to be inconsistent with vLong,
          but it should not be something we assert only at read-time. It should be asserted on write
          so that problems are found immediately.

          +1

          I think we are stuck with negative vInts, as trappy as they are (5 bytes!!).

          Let's not make it worse by allowing negative vLongs. But let's assert that at write time (and read time)...

          I think inconsistency here is the lesser evil.

          Show
          Michael McCandless added a comment - don't see how we can avoid negative vints. I think its ok to be inconsistent with vLong, but it should not be something we assert only at read-time. It should be asserted on write so that problems are found immediately. +1 I think we are stuck with negative vInts, as trappy as they are (5 bytes!!). Let's not make it worse by allowing negative vLongs. But let's assert that at write time (and read time)... I think inconsistency here is the lesser evil.
          Hide
          Michael McCandless added a comment -

          Patch, just adding assert in writeVLong that i >=0, and also strengthening existing assert in readVInt to check that top 4 (not just top 1) bits are 0.

          Show
          Michael McCandless added a comment - Patch, just adding assert in writeVLong that i >=0, and also strengthening existing assert in readVInt to check that top 4 (not just top 1) bits are 0.
          Michael McCandless made changes -
          Attachment LUCENE-3738.patch [ 12518689 ]
          Hide
          Uwe Schindler added a comment -

          I just repeat myself:

          If we disallow, it should be a hard check (no assert), as the data is coming from a file (and somebody could used a hex editor). The reader will crash later...

          Mike: If you fix the unrolled loops, please also add the checks to the other implementations in Buffered* and so on. My original patch fixed those. Please include that patch.

          Show
          Uwe Schindler added a comment - I just repeat myself: If we disallow, it should be a hard check (no assert), as the data is coming from a file (and somebody could used a hex editor). The reader will crash later... Mike: If you fix the unrolled loops, please also add the checks to the other implementations in Buffered* and so on. My original patch fixed those. Please include that patch.
          Hide
          Michael McCandless added a comment -

          If we disallow, it should be a hard check (no assert), as the data is coming from a file (and somebody could used a hex editor). The reader will crash later...

          Hmm, I don't think we should do that.

          If you go and edit your index with a hex editor... there are no guarantees on what may ensue!

          Mike: If you fix the unrolled loops, please also add the checks to the other implementations in Buffered* and so on.

          I don't think the unrolled loops or other impls of write/readVLong are wrong? The javadocs state clearly that negatives are not supported. All we're doing here is added an assert to backup that javadoc statement.

          Show
          Michael McCandless added a comment - If we disallow, it should be a hard check (no assert), as the data is coming from a file (and somebody could used a hex editor). The reader will crash later... Hmm, I don't think we should do that. If you go and edit your index with a hex editor... there are no guarantees on what may ensue! Mike: If you fix the unrolled loops, please also add the checks to the other implementations in Buffered* and so on. I don't think the unrolled loops or other impls of write/readVLong are wrong? The javadocs state clearly that negatives are not supported. All we're doing here is added an assert to backup that javadoc statement.
          Hide
          Uwe Schindler added a comment -

          Hmm, I don't think we should do that.

          It costs nothing, as the standard vInt will be read only some 1 or 2 bytes, if you really read until the last byte, you have so big vInts that it might even be better not to use vInts at all.- And: The not-unrolled loops do the check always.

          If you go and edit your index with a hex editor... there are no guarantees on what may ensue!

          Disk IO can produce wrong data. We must check this if we can and it costs nothing, which is the case here (see above).

          I was already talking with Robert, there are other asserts in the index readiung code at places completely outside any loops, executed only once when index is opened. Its first priority to do consistency checks of the read bytes. Otherwise you can even produce endless loops at some places. - Of course not when you have tight loops, but things like checking that the document count is in line with e.g. some other value from liveDocs is essential. I will open an issue for that, the older Lucene formats are much besster secured, but trunk is horrible, just because some people here seem to want to prevent any check, which is also a security issue when you e.g. download indexes through network connections and a man in the middle modifies the stream.

          Show
          Uwe Schindler added a comment - Hmm, I don't think we should do that. It costs nothing, as the standard vInt will be read only some 1 or 2 bytes, if you really read until the last byte, you have so big vInts that it might even be better not to use vInts at all.- And: The not-unrolled loops do the check always. If you go and edit your index with a hex editor... there are no guarantees on what may ensue! Disk IO can produce wrong data. We must check this if we can and it costs nothing, which is the case here (see above). I was already talking with Robert, there are other asserts in the index readiung code at places completely outside any loops, executed only once when index is opened. Its first priority to do consistency checks of the read bytes. Otherwise you can even produce endless loops at some places. - Of course not when you have tight loops, but things like checking that the document count is in line with e.g. some other value from liveDocs is essential. I will open an issue for that, the older Lucene formats are much besster secured, but trunk is horrible, just because some people here seem to want to prevent any check, which is also a security issue when you e.g. download indexes through network connections and a man in the middle modifies the stream.
          Hide
          Robert Muir added a comment -

          Otherwise you can even produce endless loops at some places. - Of course not when you have tight loops, but things like checking that the document count is in line with e.g. some other value from liveDocs is essential.

          I agree there a ton of places (essentially all metadata) where we should be using real checks (not asserts).

          I see Mike's point though: readVLong() is very general, so someone could be using it where performance is important.
          It just so happens its mostly only used today for metadata type things (except maybe terms dictionary stats and a few other places).

          Show
          Robert Muir added a comment - Otherwise you can even produce endless loops at some places. - Of course not when you have tight loops, but things like checking that the document count is in line with e.g. some other value from liveDocs is essential. I agree there a ton of places (essentially all metadata) where we should be using real checks (not asserts). I see Mike's point though: readVLong() is very general, so someone could be using it where performance is important. It just so happens its mostly only used today for metadata type things (except maybe terms dictionary stats and a few other places).
          Hide
          Uwe Schindler added a comment -

          You misunderstood my comment:

          I see Mike's point though: readVLong() is very general, so someone could be using it where performance is important.

          The check is only ommitted in the unrolled loop, the for-loop still contains the check. In that case it also handles maybe too-long vints correctly, which the unrolled code will never do. The unrolled code also has a bug, that it handles negative longs wrong, but that should be prevented (in my opinion also for ints).

          The current assert for both long and int is completely harmless, as it will only be executed, if the vInt/vLong has the maximum number of bytes, which is very unlikely. And as said before the check is done in the loop-based code, too. And comparison in perf showed that the speed of the unrolled loop and the standard loop are identical, so about what are you talking?

          The good thing here is that we can (in the unrolled loops) harden the check for negative vInts, as because of the unrolled loop we have a separate cocde branch already, so we can modify the check (which was always done before my unrolling) to do the better check.

          I had the idea at that times to unroll that loop because of Java bugs, so the bug is caused by me and I want to fix it the correct way, definitely without loosing anything. Is this so hard to understand?

          Show
          Uwe Schindler added a comment - You misunderstood my comment: I see Mike's point though: readVLong() is very general, so someone could be using it where performance is important. The check is only ommitted in the unrolled loop, the for-loop still contains the check. In that case it also handles maybe too-long vints correctly, which the unrolled code will never do. The unrolled code also has a bug, that it handles negative longs wrong, but that should be prevented (in my opinion also for ints). The current assert for both long and int is completely harmless, as it will only be executed, if the vInt/vLong has the maximum number of bytes, which is very unlikely. And as said before the check is done in the loop-based code, too. And comparison in perf showed that the speed of the unrolled loop and the standard loop are identical, so about what are you talking? The good thing here is that we can (in the unrolled loops) harden the check for negative vInts, as because of the unrolled loop we have a separate cocde branch already, so we can modify the check (which was always done before my unrolling) to do the better check. I had the idea at that times to unroll that loop because of Java bugs, so the bug is caused by me and I want to fix it the correct way, definitely without loosing anything. Is this so hard to understand?
          Uwe Schindler made changes -
          Assignee Uwe Schindler [ thetaphi ]
          Hide
          Michael McCandless added a comment -

          The check is only ommitted in the unrolled loop, the for-loop still contains the check.

          I'm confused... I don't see how/where BufferedIndexInput.readVLong is
          checking for negative result now...? Are you proposing adding an if
          into that method? That's what I don't want to do... eg, readVLong is
          called 3 times per term we decode (Lucene40 codec); it's a very low
          level API... other codecs may very well call it more often. I don't
          think we should add an if inside BII.readVLong.

          Or.... maybe you are saying you just want the unrolled code to handle
          the negative vLong case (ie, unroll the currently missing 10th cycle),
          and not add an if to BufferedIndexInput.readVLong? And then "for
          free" we can add a real if (not assert) if that 10th cycle is hit?
          (ie, if we get to that 10th byte, throw an exception). I think that
          makes sense!

          there are other asserts in the index readiung code at places completely outside any loops, executed only once when index is opened.

          +1 to make those real checks, as long as the cost is vanishingly
          small.

          which is also a security issue when you e.g. download indexes through network connections and a man in the middle modifies the stream.

          I don't think it's our job to protect against / detect that.

          Disk IO can produce wrong data.

          True, but all bets are off if that happens: you're gonna get all sorts
          of crazy exceptions out of Lucene. We are not a filesystem.

          Show
          Michael McCandless added a comment - The check is only ommitted in the unrolled loop, the for-loop still contains the check. I'm confused... I don't see how/where BufferedIndexInput.readVLong is checking for negative result now...? Are you proposing adding an if into that method? That's what I don't want to do... eg, readVLong is called 3 times per term we decode (Lucene40 codec); it's a very low level API... other codecs may very well call it more often. I don't think we should add an if inside BII.readVLong. Or.... maybe you are saying you just want the unrolled code to handle the negative vLong case (ie, unroll the currently missing 10th cycle), and not add an if to BufferedIndexInput.readVLong? And then "for free" we can add a real if (not assert) if that 10th cycle is hit? (ie, if we get to that 10th byte, throw an exception). I think that makes sense! there are other asserts in the index readiung code at places completely outside any loops, executed only once when index is opened. +1 to make those real checks, as long as the cost is vanishingly small. which is also a security issue when you e.g. download indexes through network connections and a man in the middle modifies the stream. I don't think it's our job to protect against / detect that. Disk IO can produce wrong data. True, but all bets are off if that happens: you're gonna get all sorts of crazy exceptions out of Lucene. We are not a filesystem.
          Hide
          Uwe Schindler added a comment - - edited

          The check is only ommitted in the unrolled loop, the for-loop still contains the check.

          I'm confused... I don't see how/where BufferedIndexInput.readVLong is
          checking for negative result now...?

          I mean that the actual "if (b & mask != 0)" is also in the original while loop. The original while loop then simply proceeds with reading bytes util the highest bit is null. The unrolled loop behaves different (and thats a real bug), because it will silently not read those remaining bytes, so the file pointer is on a different byte after the call. This also affects readVInt!!!

          In my opinion, we should unroll all readVInt/readVLong loops so all behave 100% identical! And in the case of the last byte read (where the current assert is), throw exception. If we don't unroll all readVInts we have to somehow also make the loop exit after too many bytes are read, which would be an costly extra check in the loop - thats the reason why I want to unroll all loops to fail after 5 or 9 bytes.

          Show
          Uwe Schindler added a comment - - edited The check is only ommitted in the unrolled loop, the for-loop still contains the check. I'm confused... I don't see how/where BufferedIndexInput.readVLong is checking for negative result now...? I mean that the actual "if (b & mask != 0)" is also in the original while loop. The original while loop then simply proceeds with reading bytes util the highest bit is null. The unrolled loop behaves different (and thats a real bug), because it will silently not read those remaining bytes, so the file pointer is on a different byte after the call. This also affects readVInt!!! In my opinion, we should unroll all readVInt/readVLong loops so all behave 100% identical! And in the case of the last byte read (where the current assert is), throw exception. If we don't unroll all readVInts we have to somehow also make the loop exit after too many bytes are read, which would be an costly extra check in the loop - thats the reason why I want to unroll all loops to fail after 5 or 9 bytes.
          Hide
          Michael McCandless added a comment -

          In my opinion, we should unroll all readVInt/readVLong loops so all behave 100% identical!

          +1

          Show
          Michael McCandless added a comment - In my opinion, we should unroll all readVInt/readVLong loops so all behave 100% identical! +1
          Hide
          Uwe Schindler added a comment - - edited

          Patch that fixes all DataInput and DataOutput subclasses to:

          • on writeVLong add Mike's assert for positive longs
          • unrolled all readVInts and readVLongs
          • the readV* methods are now more straightforward implemented at the end of method and do the same branching and fail at the end with an IOException (last statement)
          • I made all read methods in BufferedIndexInput final, as they should be never overridden (only readInternal/seekInternal). This could improve performance as hotspot knows finalness better -> inlining.
          Show
          Uwe Schindler added a comment - - edited Patch that fixes all DataInput and DataOutput subclasses to: on writeVLong add Mike's assert for positive longs unrolled all readVInts and readVLongs the readV* methods are now more straightforward implemented at the end of method and do the same branching and fail at the end with an IOException (last statement) I made all read methods in BufferedIndexInput final, as they should be never overridden (only readInternal/seekInternal). This could improve performance as hotspot knows finalness better -> inlining.
          Uwe Schindler made changes -
          Attachment LUCENE-3738.patch [ 12518846 ]
          Hide
          Michael McCandless added a comment -

          +1

          Looks awesome Uwe!

          Show
          Michael McCandless added a comment - +1 Looks awesome Uwe!
          Hide
          Uwe Schindler added a comment -

          I added a test, also checking ByteArrayDataInput and the exceptions.

          I will commit this now and then backport.

          Show
          Uwe Schindler added a comment - I added a test, also checking ByteArrayDataInput and the exceptions. I will commit this now and then backport.
          Uwe Schindler made changes -
          Attachment LUCENE-3738.patch [ 12518858 ]
          uschindler committed 1302238 (6 files)
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1302238

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1302238
          Hide
          Uwe Schindler added a comment -

          Committed 3.x revision: 1302242

          Show
          Uwe Schindler added a comment - Committed 3.x revision: 1302242
          Uwe Schindler made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          uschindler committed 1302243 (1 file)
          Reviews: none

          LUCENE-3738: Add missing backwards break

          uschindler committed 1302244 (3 files)
          Reviews: none

          Merged revision(s) 1302243 from lucene/dev/branches/branch_3x:
          LUCENE-3738: Add missing backwards break

          Hide
          Yonik Seeley added a comment -

          Regarding unrolling... it hasn't always proved faster in the past, esp wrt vint.

          My first try was in 2005: http://www.lucidimagination.com/search/document/6d2efedb4dde07d#2a896a9a9adc3f2d
          And again in 2006: https://issues.apache.org/jira/browse/LUCENE-639

          Show
          Yonik Seeley added a comment - Regarding unrolling... it hasn't always proved faster in the past, esp wrt vint. My first try was in 2005: http://www.lucidimagination.com/search/document/6d2efedb4dde07d#2a896a9a9adc3f2d And again in 2006: https://issues.apache.org/jira/browse/LUCENE-639
          Hide
          Uwe Schindler added a comment -

          Yonik, the unrolling was added because of a recent Java 6 hotspot bug (who unrolled the loop itsself - but wrongly). The thing that Mike has seen was a strange thing that the already unrolled code (since 3.1) behaves different before/after a slight code change done in this issue.

          Show
          Uwe Schindler added a comment - Yonik, the unrolling was added because of a recent Java 6 hotspot bug (who unrolled the loop itsself - but wrongly). The thing that Mike has seen was a strange thing that the already unrolled code (since 3.1) behaves different before/after a slight code change done in this issue.
          Hide
          Robert Muir added a comment -

          The original unrolling here was to dodge a JVM bug, possible
          in all java versions from ... java6u20 until java6u29 or so?

          I don't know if there's another solution other than unrolling
          to work around that loop bug.

          I don't like the workaround but it does seem realistic at this
          point to prevent index corruption since these versions of
          java are really recent.

          Show
          Robert Muir added a comment - The original unrolling here was to dodge a JVM bug, possible in all java versions from ... java6u20 until java6u29 or so? I don't know if there's another solution other than unrolling to work around that loop bug. I don't like the workaround but it does seem realistic at this point to prevent index corruption since these versions of java are really recent.
          Hide
          Uwe Schindler added a comment -

          The problem might come from the asserts (they are not completely non-existent -> classloader does not remove them; the JVM relies on hotspot removing them when hotspot sees "dead" code -> "final static boolean $assertionsDisabled" in every class). The method in this class got too complex by that. We should better remove the assert and not use it in small methods (because they prevent inlining): http://goo.gl/KjrXe

          Show
          Uwe Schindler added a comment - The problem might come from the asserts (they are not completely non-existent -> classloader does not remove them; the JVM relies on hotspot removing them when hotspot sees "dead" code -> "final static boolean $assertionsDisabled" in every class). The method in this class got too complex by that. We should better remove the assert and not use it in small methods (because they prevent inlining): http://goo.gl/KjrXe
          Uwe Schindler made changes -
          Attachment ByteArrayDataInput.java.patch [ 12520155 ]
          uschindler committed 1305909 (1 file)
          Reviews: none

          LUCENE-3738: Remove the asserts in this class (in most cases not useful, only useful if limit<b.length

          Hide
          Michael McCandless added a comment -

          +1 to remove those asserts... let's see if this fixes the slowdown the nightly builds hit on 3/18: http://people.apache.org/~mikemccand/lucenebench/IntNRQ.html

          Show
          Michael McCandless added a comment - +1 to remove those asserts... let's see if this fixes the slowdown the nightly builds hit on 3/18: http://people.apache.org/~mikemccand/lucenebench/IntNRQ.html
          uschindler committed 1305911 (4 files)
          Reviews: none

          Merged revision(s) 1305909 from lucene/dev/trunk:
          LUCENE-3738: Remove the asserts in this class (in most cases not useful, only useful if limit<b.length

          Hide
          Uwe Schindler added a comment -

          Committed the assert removal in revision: trunk 1305909, 3.x 1305911

          If this does not help, we can revert again. But the checks are in my opinion not really useful and too risky.

          Show
          Uwe Schindler added a comment - Committed the assert removal in revision: trunk 1305909, 3.x 1305911 If this does not help, we can revert again. But the checks are in my opinion not really useful and too risky.
          Hide
          Michael McCandless added a comment -

          Removing the asserts apparently didn't change the perf...

          I can reproduce the slowdown in a separate test (before/after this commit):

                          Task    QPS base StdDev base    QPS vInt StdDev vInt      Pct diff
                        IntNRQ        7.11        0.89        6.73        0.58  -23% -   17%
                       Prefix3       16.07        0.96       15.65        0.72  -12% -    8%
                      Wildcard       20.14        0.91       19.67        0.77  -10% -    6%
                      PKLookup      154.62        5.08      151.11        2.82   -7% -    2%
                        Fuzzy1       85.24        1.53       83.87        1.18   -4% -    1%
                        Fuzzy2       44.11        1.03       43.96        0.44   -3% -    3%
                      SpanNear        3.23        0.11        3.22        0.07   -5% -    5%
                TermBGroup1M1P       42.35        0.49       42.43        1.43   -4% -    4%
                       Respell       65.11        1.91       65.27        1.27   -4% -    5%
                    AndHighMed       54.18        4.04       54.50        2.27  -10% -   13%
                   TermGroup1M       31.27        0.35       31.46        0.63   -2% -    3%
                  TermBGroup1M       45.01        0.33       45.37        1.42   -3% -    4%
                   AndHighHigh       13.35        0.71       13.46        0.50   -7% -   10%
                          Term       82.71        3.12       83.56        2.33   -5% -    7%
                     OrHighMed       10.66        0.67       10.78        0.44   -8% -   12%
                    OrHighHigh        7.08        0.42        7.19        0.26   -7% -   11%
                  SloppyPhrase        5.11        0.24        5.20        0.31   -8% -   13%
                        Phrase       11.14        0.75       11.40        0.50   -8% -   14%
          

          But then Uwe made a patch (I'll attach) reducing the byte code for the
          unrolled methods:

                          Task    QPS base StdDev base    QPS vInt StdDev vInt      Pct diff
                      SpanNear        3.24        0.13        3.18        0.07   -7% -    4%
                        Phrase       11.34        0.68       11.13        0.38  -10% -    7%
                  SloppyPhrase        5.17        0.23        5.08        0.18   -9% -    6%
                TermBGroup1M1P       41.92        0.80       41.57        0.94   -4% -    3%
                   TermGroup1M       30.74        0.68       30.81        0.96   -5% -    5%
                          Term       80.87        3.52       81.29        2.05   -6% -    7%
                  TermBGroup1M       43.94        0.93       44.17        1.32   -4% -    5%
                    AndHighMed       53.71        2.62       54.21        1.97   -7% -    9%
                   AndHighHigh       13.20        0.42       13.41        0.41   -4% -    8%
                       Respell       65.37        2.70       66.53        3.29   -7% -   11%
                        Fuzzy1       84.29        2.11       86.44        3.36   -3% -    9%
                      PKLookup      149.81        4.20      153.87        9.46   -6% -   12%
                    OrHighHigh        7.19        0.28        7.40        0.48   -7% -   13%
                     OrHighMed       10.82        0.43       11.16        0.73   -7% -   14%
                        Fuzzy2       43.72        0.96       45.24        2.03   -3% -   10%
                      Wildcard       18.96        1.00       20.05        0.39   -1% -   13%
                       Prefix3       14.96        0.83       15.89        0.27   -1% -   14%
                        IntNRQ        5.89        0.58        6.95        0.17    4% -   34%
          

          So... I think we should commit it!

          Show
          Michael McCandless added a comment - Removing the asserts apparently didn't change the perf... I can reproduce the slowdown in a separate test (before/after this commit): Task QPS base StdDev base QPS vInt StdDev vInt Pct diff IntNRQ 7.11 0.89 6.73 0.58 -23% - 17% Prefix3 16.07 0.96 15.65 0.72 -12% - 8% Wildcard 20.14 0.91 19.67 0.77 -10% - 6% PKLookup 154.62 5.08 151.11 2.82 -7% - 2% Fuzzy1 85.24 1.53 83.87 1.18 -4% - 1% Fuzzy2 44.11 1.03 43.96 0.44 -3% - 3% SpanNear 3.23 0.11 3.22 0.07 -5% - 5% TermBGroup1M1P 42.35 0.49 42.43 1.43 -4% - 4% Respell 65.11 1.91 65.27 1.27 -4% - 5% AndHighMed 54.18 4.04 54.50 2.27 -10% - 13% TermGroup1M 31.27 0.35 31.46 0.63 -2% - 3% TermBGroup1M 45.01 0.33 45.37 1.42 -3% - 4% AndHighHigh 13.35 0.71 13.46 0.50 -7% - 10% Term 82.71 3.12 83.56 2.33 -5% - 7% OrHighMed 10.66 0.67 10.78 0.44 -8% - 12% OrHighHigh 7.08 0.42 7.19 0.26 -7% - 11% SloppyPhrase 5.11 0.24 5.20 0.31 -8% - 13% Phrase 11.14 0.75 11.40 0.50 -8% - 14% But then Uwe made a patch (I'll attach) reducing the byte code for the unrolled methods: Task QPS base StdDev base QPS vInt StdDev vInt Pct diff SpanNear 3.24 0.13 3.18 0.07 -7% - 4% Phrase 11.34 0.68 11.13 0.38 -10% - 7% SloppyPhrase 5.17 0.23 5.08 0.18 -9% - 6% TermBGroup1M1P 41.92 0.80 41.57 0.94 -4% - 3% TermGroup1M 30.74 0.68 30.81 0.96 -5% - 5% Term 80.87 3.52 81.29 2.05 -6% - 7% TermBGroup1M 43.94 0.93 44.17 1.32 -4% - 5% AndHighMed 53.71 2.62 54.21 1.97 -7% - 9% AndHighHigh 13.20 0.42 13.41 0.41 -4% - 8% Respell 65.37 2.70 66.53 3.29 -7% - 11% Fuzzy1 84.29 2.11 86.44 3.36 -3% - 9% PKLookup 149.81 4.20 153.87 9.46 -6% - 12% OrHighHigh 7.19 0.28 7.40 0.48 -7% - 13% OrHighMed 10.82 0.43 11.16 0.73 -7% - 14% Fuzzy2 43.72 0.96 45.24 2.03 -3% - 10% Wildcard 18.96 1.00 20.05 0.39 -1% - 13% Prefix3 14.96 0.83 15.89 0.27 -1% - 14% IntNRQ 5.89 0.58 6.95 0.17 4% - 34% So... I think we should commit it!
          Hide
          Michael McCandless added a comment -

          Uwe's patch to reduce bytecode for the readVInt/Long methods...

          Show
          Michael McCandless added a comment - Uwe's patch to reduce bytecode for the readVInt/Long methods...
          Michael McCandless made changes -
          Attachment LUCENE-3738.patch [ 12520629 ]
          Hide
          Uwe Schindler added a comment -

          I an commit that, OK?

          We should also do this in 3.x, Robert are you fine? Otherwise this issue is only half committed to 3.x Its no risk.

          Show
          Uwe Schindler added a comment - I an commit that, OK? We should also do this in 3.x, Robert are you fine? Otherwise this issue is only half committed to 3.x Its no risk.
          Hide
          Robert Muir added a comment -

          I'm going with your instinct on this one. It would be bad to have a slowdown for 3.6,
          but I want the negative vlong checks, too.

          Show
          Robert Muir added a comment - I'm going with your instinct on this one. It would be bad to have a slowdown for 3.6, but I want the negative vlong checks, too.
          Hide
          Uwe Schindler added a comment -

          After looking a while on the code, I have a further minor improvement. The most common case (int < 128) now exits directly after reading the byte without any & or variable assignment operations.

          Mike: Can you look at it and maybe do a quick test? I would like to commit this this evening to both branches.

          Show
          Uwe Schindler added a comment - After looking a while on the code, I have a further minor improvement. The most common case (int < 128) now exits directly after reading the byte without any & or variable assignment operations. Mike: Can you look at it and maybe do a quick test? I would like to commit this this evening to both branches.
          Uwe Schindler made changes -
          Attachment LUCENE-3738-improvement.patch [ 12520780 ]
          Uwe Schindler made changes -
          Attachment LUCENE-3738.patch [ 12520629 ]
          Uwe Schindler made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Uwe Schindler made changes -
          Priority Major [ 3 ] Blocker [ 1 ]
          Hide
          Michael McCandless added a comment -

          Thanks Uwe, I'll test!

          Show
          Michael McCandless added a comment - Thanks Uwe, I'll test!
          Hide
          Michael McCandless added a comment -

          Alas, the results are now all over the place! And I went back to the prior patch and tried to reproduce the above results... and the results are still all over the place. I think we are chasing Java ghosts at this point...

          Show
          Michael McCandless added a comment - Alas, the results are now all over the place! And I went back to the prior patch and tried to reproduce the above results... and the results are still all over the place. I think we are chasing Java ghosts at this point...
          Hide
          Uwe Schindler added a comment -

          What does your comment mean? Good or bad?

          Show
          Uwe Schindler added a comment - What does your comment mean? Good or bad?
          Hide
          Uwe Schindler added a comment -

          Mike, I was away from home and did not understand your comment, now its clear: You cannot reproduce the speedup from last patch neither can you see a difference with current patch.

          I would suggest that I commit this now to trunk, we test a few nights and then commit it to 3.x (Robert needs to backport Ivy to 3.6, so we have some time).

          I will commit this later before going to sleep, so we see results tomorrow.

          Show
          Uwe Schindler added a comment - Mike, I was away from home and did not understand your comment, now its clear: You cannot reproduce the speedup from last patch neither can you see a difference with current patch. I would suggest that I commit this now to trunk, we test a few nights and then commit it to 3.x (Robert needs to backport Ivy to 3.6, so we have some time). I will commit this later before going to sleep, so we see results tomorrow.
          Hide
          Michael McCandless added a comment -

          Sorry Uwe, that was exactly it: I don't know what to conclude from the perf runs anymore.

          But +1 for your new patch: it ought to be better since the code is simpler.

          Show
          Michael McCandless added a comment - Sorry Uwe, that was exactly it: I don't know what to conclude from the perf runs anymore. But +1 for your new patch: it ought to be better since the code is simpler.
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1307910

          I will keep this issue open for merging to 3.x the next days.

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1307910 I will keep this issue open for merging to 3.x the next days.
          Hide
          Robert Muir added a comment -

          since there is no performance regression, let's do trunk only.

          Show
          Robert Muir added a comment - since there is no performance regression, let's do trunk only.
          uschindler committed 1308117 (2 files)
          uschindler committed 1308119 (5 files)
          Hide
          Uwe Schindler added a comment -

          I added a random test for vints and vlogs in TestIndexInput. I wanted to especially test the long case, which looked broken in DataOutput (but it was correct - but only because of the way how java handles negative values when casting to long - I just made it clear what happens).

          Robert: The tests last night showed no change, so I have no preference.

          Show
          Uwe Schindler added a comment - I added a random test for vints and vlogs in TestIndexInput. I wanted to especially test the long case, which looked broken in DataOutput (but it was correct - but only because of the way how java handles negative values when casting to long - I just made it clear what happens). Robert: The tests last night showed no change, so I have no preference.
          Hide
          Robert Muir added a comment -

          Uwe, is this one good to go now? Can we mark it resolved?

          Show
          Robert Muir added a comment - Uwe, is this one good to go now? Can we mark it resolved?
          Hide
          Uwe Schindler added a comment -

          Commit patch or not? I have no preference. Perf is sometimes slightly better (in microbenchmark on my slow-io system always).

          Show
          Uwe Schindler added a comment - Commit patch or not? I have no preference. Perf is sometimes slightly better (in microbenchmark on my slow-io system always).
          Hide
          Robert Muir added a comment -

          If there is no real measurable performance regression, can we just commit it to trunk?

          I really am afraid of last minute .store optimizations.

          Show
          Robert Muir added a comment - If there is no real measurable performance regression, can we just commit it to trunk? I really am afraid of last minute .store optimizations.
          Hide
          Uwe Schindler added a comment -

          Part of the optimization is already committed to 3.x since 4 days, this is just another round of optimization.

          Show
          Uwe Schindler added a comment - Part of the optimization is already committed to 3.x since 4 days, this is just another round of optimization.
          Hide
          Uwe Schindler added a comment - - edited

          I don't care, it's committed to trunk. The new random test I created yesterday was committed to 3.x, but not the latest code change/opto. You are the release manager, do what you prefer

          Show
          Uwe Schindler added a comment - - edited I don't care, it's committed to trunk. The new random test I created yesterday was committed to 3.x, but not the latest code change/opto. You are the release manager, do what you prefer
          Hide
          Robert Muir added a comment -

          Lets resolve the issue.

          Show
          Robert Muir added a comment - Lets resolve the issue.
          Hide
          Uwe Schindler added a comment -

          Latest code change not backported.

          Show
          Uwe Schindler added a comment - Latest code change not backported.
          Uwe Schindler made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Robert Muir made changes -
          Link This issue is related to LUCENE-4641 [ LUCENE-4641 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development