Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 4.4
    • Fix Version/s: 4.5, 6.0
    • Component/s: core/FSTs
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned output from and FST (BytesRef) which caused sideffects in later execution.

      I added an assertion into the FST that checks if a cached root arc is modified and in-fact this happens for instance in our MemoryPostingsFormat and I bet we find more places. We need to think about how to make this less trappy since it can cause bugs that are super hard to find.

      1. LUCENE-5152.patch
        1.0 kB
        Michael McCandless
      2. LUCENE-5152.patch
        8 kB
        Simon Willnauer
      3. LUCENE-5152.patch
        11 kB
        Simon Willnauer
      4. LUCENE-5152.patch
        4 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        here is a patch with the assert and a nocommit in MemoryPostingsFormat

        Show
        Simon Willnauer added a comment - here is a patch with the assert and a nocommit in MemoryPostingsFormat
        Hide
        Robert Muir added a comment -

        So its really just a BytesRef bug right? Because root arcs cache uses copyFrom, but this does a shallow copy of the output/nextFinalOutput, and in this case they are pointing to the same bytes (which gives someone the chance to muck with them).

        Show
        Robert Muir added a comment - So its really just a BytesRef bug right? Because root arcs cache uses copyFrom, but this does a shallow copy of the output/nextFinalOutput, and in this case they are pointing to the same bytes (which gives someone the chance to muck with them).
        Hide
        Han Jiang added a comment -

        So its really just a BytesRef bug right?

        +1, so tricky

        Show
        Han Jiang added a comment - So its really just a BytesRef bug right? +1, so tricky
        Hide
        Simon Willnauer added a comment -

        So its really just a BytesRef bug right?

        well in theory that is true. Yet, if you have an arc in your hand you can basically change it by passing it to a subsequent call to readNextTargetArc or whatever that would override the values completely. BytesRef is tricky but not the root cause of this issue. I do think that if you call:

         
        public Arc<T> findTargetArc(int labelToMatch, Arc<T> follow, Arc<T> arc, BytesReader in) throws IOException
        

        it should always fill the arc that is provided so everything you do with it is up to you. Aside of this I agree BytesRef is tricky and we should fix if possible.

        Show
        Simon Willnauer added a comment - So its really just a BytesRef bug right? well in theory that is true. Yet, if you have an arc in your hand you can basically change it by passing it to a subsequent call to readNextTargetArc or whatever that would override the values completely. BytesRef is tricky but not the root cause of this issue. I do think that if you call: public Arc<T> findTargetArc(int labelToMatch, Arc<T> follow, Arc<T> arc, BytesReader in) throws IOException it should always fill the arc that is provided so everything you do with it is up to you. Aside of this I agree BytesRef is tricky and we should fix if possible.
        Hide
        Simon Willnauer added a comment -

        here is a patch that adds a #deepCopy method to Outputs that allows me to do a deep copy if the actual arc that is returned is a cached root arc. I think we should never return a pointer into the root arcs though. this is way to dangerous! I haven't run any perf tests will do once I am on my worksstation again.. if somebody beats me go ahead!

        Show
        Simon Willnauer added a comment - here is a patch that adds a #deepCopy method to Outputs that allows me to do a deep copy if the actual arc that is returned is a cached root arc. I think we should never return a pointer into the root arcs though. this is way to dangerous! I haven't run any perf tests will do once I am on my worksstation again.. if somebody beats me go ahead!
        Hide
        Jack Krupansky added a comment -

        immutale

        Is that the Latin term for "immutable"??

        (spelling in summary line)

        Show
        Jack Krupansky added a comment - immutale Is that the Latin term for "immutable"?? (spelling in summary line)
        Hide
        Robert Muir added a comment -

        I guess one question would be if its FSTs job to defend against bytesref bugs.

        This issue was driven because there was a bytesref bug for suggester payloads.
        The same kind of bug could happen, e.g. if someone uses DirectPostings and modifies the payload coming back from the postings lists.

        Should we clone payload bytes in the postings lists too? what about term dictionaries?

        At some point then BytesRef is useless as a reference class because of a few bad apples trying to use it as a ByteBuffer.
        Ideally we would remove code that abuses BytesRef as a ByteBuffer instead.

        I don't mean to pick on your issue Simon, and it doesnt mean I object to the patch (though I wonder about performance implications), I just see this as one of many in a larger issue.

        Show
        Robert Muir added a comment - I guess one question would be if its FSTs job to defend against bytesref bugs. This issue was driven because there was a bytesref bug for suggester payloads. The same kind of bug could happen, e.g. if someone uses DirectPostings and modifies the payload coming back from the postings lists. Should we clone payload bytes in the postings lists too? what about term dictionaries? At some point then BytesRef is useless as a reference class because of a few bad apples trying to use it as a ByteBuffer. Ideally we would remove code that abuses BytesRef as a ByteBuffer instead. I don't mean to pick on your issue Simon, and it doesnt mean I object to the patch (though I wonder about performance implications), I just see this as one of many in a larger issue.
        Hide
        Simon Willnauer added a comment -

        Should we clone payload bytes in the postings lists too? what about term dictionaries?

        I agree we can be less conservative here and just use the payload and copy it into a new BytesRef or whatever is needed. I will bring up a new patch.

        At some point then BytesRef is useless as a reference class because of a few bad apples trying to use it as a ByteBuffer. Ideally we would remove code that abuses BytesRef as a ByteBuffer instead.

        agreed again. We just need to make sure that we have asserts in place that check for that.

        I don't mean to pick on your issue Simon, and it doesnt mean I object to the patch (though I wonder about performance implications), I just see this as one of many in a larger issue.

        no worries. I am really concerned about this since it took me forever to figure out the problems this caused. I just wanna have an infra in place that catches those problems. I am more concerned about users that get bitten by this. I agree we should figure out the bigger problem eventually but lets make sure that we fix the bad apples first

        Show
        Simon Willnauer added a comment - Should we clone payload bytes in the postings lists too? what about term dictionaries? I agree we can be less conservative here and just use the payload and copy it into a new BytesRef or whatever is needed. I will bring up a new patch. At some point then BytesRef is useless as a reference class because of a few bad apples trying to use it as a ByteBuffer. Ideally we would remove code that abuses BytesRef as a ByteBuffer instead. agreed again. We just need to make sure that we have asserts in place that check for that. I don't mean to pick on your issue Simon, and it doesnt mean I object to the patch (though I wonder about performance implications), I just see this as one of many in a larger issue. no worries. I am really concerned about this since it took me forever to figure out the problems this caused. I just wanna have an infra in place that catches those problems. I am more concerned about users that get bitten by this. I agree we should figure out the bigger problem eventually but lets make sure that we fix the bad apples first
        Hide
        Simon Willnauer added a comment -

        this patch only adds the assert and fixes the problems in MemoryPostings. This could solve the immediate issue and adds some more asserts to make sure we realise if something modifies the arcs outputs.

        Show
        Simon Willnauer added a comment - this patch only adds the assert and fixes the problems in MemoryPostings. This could solve the immediate issue and adds some more asserts to make sure we realise if something modifies the arcs outputs.
        Hide
        David Smiley added a comment -

        I've been bitten by bugs in my code related to sharing BytesRefs. It's not clear in the APIs who "owns" a BytesRef received from a method call. In other words, can a recipient know wether it can safely cache it for use later or wether it's not safe because it could change suddenly later. It's not even just that since there are two levels of ownership, the BytesRef instance and then the underlying char array. This could be partially addressed with more documentation; I did that in my own code but that didn't stop another bug.

        I have a rough idea in which a BytesRef has a boolean flag or two to indicate how share-able it is, and then a conditional clone method that either returns the current instance or returns shallow copy, or does a deep copy depending on those flags. I dunno, just a half-baked idea.

        Show
        David Smiley added a comment - I've been bitten by bugs in my code related to sharing BytesRefs. It's not clear in the APIs who "owns" a BytesRef received from a method call. In other words, can a recipient know wether it can safely cache it for use later or wether it's not safe because it could change suddenly later. It's not even just that since there are two levels of ownership, the BytesRef instance and then the underlying char array. This could be partially addressed with more documentation; I did that in my own code but that didn't stop another bug. I have a rough idea in which a BytesRef has a boolean flag or two to indicate how share-able it is, and then a conditional clone method that either returns the current instance or returns shallow copy, or does a deep copy depending on those flags. I dunno, just a half-baked idea.
        Hide
        ASF subversion and git services added a comment -

        Commit 1509805 from Simon Willnauer in branch 'dev/trunk'
        [ https://svn.apache.org/r1509805 ]

        LUCENE-5152: Fix MemoryPostingsFormat to not modify borrowed BytesRef from FSTEnum

        Show
        ASF subversion and git services added a comment - Commit 1509805 from Simon Willnauer in branch 'dev/trunk' [ https://svn.apache.org/r1509805 ] LUCENE-5152 : Fix MemoryPostingsFormat to not modify borrowed BytesRef from FSTEnum
        Hide
        Simon Willnauer added a comment -

        I committed the latest patch to get the assert in and fix memory postings format. I will keep this issue open for further discussions

        Show
        Simon Willnauer added a comment - I committed the latest patch to get the assert in and fix memory postings format. I will keep this issue open for further discussions
        Hide
        ASF subversion and git services added a comment -

        Commit 1509812 from Simon Willnauer in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1509812 ]

        LUCENE-5152: Fix MemoryPostingsFormat to not modify borrowed BytesRef from FSTEnum

        Show
        ASF subversion and git services added a comment - Commit 1509812 from Simon Willnauer in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1509812 ] LUCENE-5152 : Fix MemoryPostingsFormat to not modify borrowed BytesRef from FSTEnum
        Hide
        Michael McCandless added a comment -

        I know we make no claims about performance when assertions are enabled...

        But it seems like this commit could substantially slow things down, since we call assertRootArcs, which is O(N) where N = number of root arcs, for every call to findTargetArc = likely called in hotspots.

        Maybe, we could move the call to e.g. FST.getBytesReader?

        Show
        Michael McCandless added a comment - I know we make no claims about performance when assertions are enabled... But it seems like this commit could substantially slow things down, since we call assertRootArcs, which is O(N) where N = number of root arcs, for every call to findTargetArc = likely called in hotspots. Maybe, we could move the call to e.g. FST.getBytesReader?
        Hide
        Simon Willnauer added a comment -

        how is getBytesReader related to the root arcs? unless this slows things down when assertions are not enabled I'd vote for not moving it. This is a very trappy thing and we should catch any violation IMO very quickly.

        Show
        Simon Willnauer added a comment - how is getBytesReader related to the root arcs? unless this slows things down when assertions are not enabled I'd vote for not moving it. This is a very trappy thing and we should catch any violation IMO very quickly.
        Hide
        Michael McCandless added a comment -

        how is getBytesReader related to the root arcs?

        Well, anything that works with the FST APIs needs to call
        getBytesReader first, e.g. MemoryPF does this every time you pull a
        TermsEnum from it.

        This is a very trappy thing and we should catch any violation IMO very quickly.

        I agree it's trappy and it's great to add this check.

        I'm simply proposing moving it to less of a hot-spot, and I don't
        think this will affect how quickly we catch violations but should
        reduce the cost of this added assertion.

        In fact, I think findTargetArc isn't great in this regard; e.g. I
        think MemoryPF only uses this API if the caller calls seekExact? So I
        think the current location of the assert is both more costly and lower
        coverage than if we moved it to FST.getBytesReader.

        Show
        Michael McCandless added a comment - how is getBytesReader related to the root arcs? Well, anything that works with the FST APIs needs to call getBytesReader first, e.g. MemoryPF does this every time you pull a TermsEnum from it. This is a very trappy thing and we should catch any violation IMO very quickly. I agree it's trappy and it's great to add this check. I'm simply proposing moving it to less of a hot-spot, and I don't think this will affect how quickly we catch violations but should reduce the cost of this added assertion. In fact, I think findTargetArc isn't great in this regard; e.g. I think MemoryPF only uses this API if the caller calls seekExact? So I think the current location of the assert is both more costly and lower coverage than if we moved it to FST.getBytesReader.
        Hide
        Simon Willnauer added a comment -

        In fact, I think findTargetArc isn't great in this regard; e.g. I
        think MemoryPF only uses this API if the caller calls seekExact? So I
        think the current location of the assert is both more costly and lower
        coverage than if we moved it to FST.getBytesReader.

        I am not sure if that is true actually since findTargetArc is the only place where we actually use this cache? I mean I don't wanna be in the way of moving it but what kind of message are we sending here. We don't guarantee that asserts are fast but they won't slow you down massively? I mean we can call this assert in the getBytesReader as well to increase coverage, can you elaborate what you are concerned about?

        Show
        Simon Willnauer added a comment - In fact, I think findTargetArc isn't great in this regard; e.g. I think MemoryPF only uses this API if the caller calls seekExact? So I think the current location of the assert is both more costly and lower coverage than if we moved it to FST.getBytesReader. I am not sure if that is true actually since findTargetArc is the only place where we actually use this cache? I mean I don't wanna be in the way of moving it but what kind of message are we sending here. We don't guarantee that asserts are fast but they won't slow you down massively? I mean we can call this assert in the getBytesReader as well to increase coverage, can you elaborate what you are concerned about?
        Hide
        Michael McCandless added a comment -

        can you elaborate what you are concerned about?

        I'm worried about the O(N^2) cost of the assert: for every arc (single
        byte of each term in a seekExact) we are iterating over all root arcs
        (up to 256 arcs) in this assert.

        findTargetArc is the only place where we actually use this cache?

        Ahh that's true, I hadn't realized that.

        Maybe, instead, we can move the assert just inside the if that
        actually uses the cached arcs? Ie, put it here:

          if (follow.target == startNode && labelToMatch < cachedRootArcs.length) {
            assert assertRootArcs();
            ...
          }
        

        This would address my concern: the cost becomes O(N) not O(N^2). And
        the coverage is the same?

        Show
        Michael McCandless added a comment - can you elaborate what you are concerned about? I'm worried about the O(N^2) cost of the assert: for every arc (single byte of each term in a seekExact) we are iterating over all root arcs (up to 256 arcs) in this assert. findTargetArc is the only place where we actually use this cache? Ahh that's true, I hadn't realized that. Maybe, instead, we can move the assert just inside the if that actually uses the cached arcs? Ie, put it here: if (follow.target == startNode && labelToMatch < cachedRootArcs.length) { assert assertRootArcs(); ... } This would address my concern: the cost becomes O(N) not O(N^2). And the coverage is the same?
        Hide
        Simon Willnauer added a comment -

        This would address my concern: the cost becomes O(N) not O(N^2). And the coverage is the same?

        The problem here is that we really need to check after we returned from the cache and that might be the case only once in a certain test. Yet, I think it's OK to do it there. I still don't get what you are concerned of we only have -ea in tests and the tests don't seem to be any slower? Can you elaborate what you are afraid of?

        Show
        Simon Willnauer added a comment - This would address my concern: the cost becomes O(N) not O(N^2). And the coverage is the same? The problem here is that we really need to check after we returned from the cache and that might be the case only once in a certain test. Yet, I think it's OK to do it there. I still don't get what you are concerned of we only have -ea in tests and the tests don't seem to be any slower? Can you elaborate what you are afraid of?
        Hide
        Michael McCandless added a comment -

        Can you elaborate what you are afraid of?

        In general I think it's bad if an assert changes too much how the code
        would run without asserts. E.g., maybe this O(N^2) assert alters how
        threads are scheduled and changes how / whether an issue appears in
        practice.

        Similarly, if a user is having trouble, I'll recommend turning on
        asserts to see if one trips, but if this causes a change in how the
        code runs then this can change whether the issue reproduces.

        I also just don't like O(N^2) code, even when it's under an assert

        I think asserts should minimize their impact to the real code when
        possible, and it certainly seems possible in this case.

        Separately, we really should run our tests w/o asserts, too, since
        this is how our users typically run (I know some tests fail if
        assertions are off ... we'd have to fix them). What if we accidentally
        commit "real" code behind an assert? Our tests wouldn't catch it ...

        Show
        Michael McCandless added a comment - Can you elaborate what you are afraid of? In general I think it's bad if an assert changes too much how the code would run without asserts. E.g., maybe this O(N^2) assert alters how threads are scheduled and changes how / whether an issue appears in practice. Similarly, if a user is having trouble, I'll recommend turning on asserts to see if one trips, but if this causes a change in how the code runs then this can change whether the issue reproduces. I also just don't like O(N^2) code, even when it's under an assert I think asserts should minimize their impact to the real code when possible, and it certainly seems possible in this case. Separately, we really should run our tests w/o asserts, too, since this is how our users typically run (I know some tests fail if assertions are off ... we'd have to fix them). What if we accidentally commit "real" code behind an assert? Our tests wouldn't catch it ...
        Hide
        Michael McCandless added a comment -

        This is what I have in mind ... just moving the assert inside the if that actually uses the cached root arcs.

        Show
        Michael McCandless added a comment - This is what I have in mind ... just moving the assert inside the if that actually uses the cached root arcs.
        Hide
        Michael McCandless added a comment -

        I plan to commit that last patch soon...

        Show
        Michael McCandless added a comment - I plan to commit that last patch soon...
        Hide
        Simon Willnauer added a comment -

        sorry for the late reply busy times... +1 to commit

        Show
        Simon Willnauer added a comment - sorry for the late reply busy times... +1 to commit
        Hide
        ASF subversion and git services added a comment -

        Commit 1514520 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1514520 ]

        LUCENE-5152: make assert less costly

        Show
        ASF subversion and git services added a comment - Commit 1514520 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1514520 ] LUCENE-5152 : make assert less costly
        Hide
        ASF subversion and git services added a comment -

        Commit 1514521 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1514521 ]

        LUCENE-5152: make assert less costly

        Show
        ASF subversion and git services added a comment - Commit 1514521 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1514521 ] LUCENE-5152 : make assert less costly
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Unassigned
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development