Uploaded image for project: 'UIMA'
  1. UIMA
  2. UIMA-3399

More consistent handling of multiple add-to-index behavior for same Feature Structure

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.7.0SDK
    • Component/s: None
    • Labels:
      None

      Description

      UIMA has a somewhat unusual indexing architecture. You can define indexes (sorted, bag, set), and then add / remove a feature structure (FS) to all of the defined indexes.

      The design intention (I think) was to support the concept of a FS being indexed, or not. However, the current design allows some anomalies that behave inconsistently between code being run "locally", versus as remote services (due to how serialization handles this). Serialization encodes only the concept of a FS being either in an index or not.

      The problem arises in the edge case where the same identical FS is added to the indexes multiple times. For local (non-remote) cases, for bag and sorted indexes, the same exact FS would be added multiple times. This would have the consequences:

      • Iterating would return multiple == FSs.
      • Remove from indexes of a multiply-added FS would reduce the number by 1; the FS would still be in the index unless the last remaining one was removed..

      For the same code, running remotely, serialization would have "collapsed" the multiple additions into one, so would behave differently.

      This Jira changes the behavior of "add-to-index" so that subsequent add-to-indexes of a same identical FS would be a no-op. To cover users who might be exploiting the old behavior, the JVM property "uima.allow_duplicate_add_to_indices", read when the UIMA classes are loaded, would restore the previous behavior.

      Note that with this change, the UIMA "Set" index still has a distinct purpose , separate from the "Bag" index, because it defines Feature Structure equivalence based not on identity, but rather on specified key feature values being equal.

      This change better aligns how code running locally or remotely works.

        Activity

        Hide
        rec Richard Eckart de Castilho added a comment -

        I actually did update before testing when you asked me, but the server apparently didn't offer me the right revision at the time. I posted the rev because I know that sometimes happens

        Updating again to rev 1653620, it works! Thanks!

        Show
        rec Richard Eckart de Castilho added a comment - I actually did update before testing when you asked me, but the server apparently didn't offer me the right revision at the time. I posted the rev because I know that sometimes happens Updating again to rev 1653620, it works! Thanks!
        Hide
        schor Marshall Schor added a comment -

        hmmm, the revision level should be 1653587 is the above a typo? If not , can you update to that and try?

        Show
        schor Marshall Schor added a comment - hmmm, the revision level should be 1653587 is the above a typo? If not , can you update to that and try?
        Hide
        rec Richard Eckart de Castilho added a comment -

        Tried with rev 1653378 - didn't seem to help.

        moveTo(fs) in LeafPointerIterator:967 still updates the itPos from 66 to 531.
        After that index.isValid() still returns false.

        Since I'm unable to produce a minimal test case, maybe I can invite you to test in-situ.

        You could check out https://dkpro-core-asl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-asl/trunk in your Eclipse and debug the test de.tudarmstadt.ukp.dkpro.core.io.tcf.TcfReaderWriterTest.test1() in the Maven module de.tudarmstadt.ukp.dkpro.core.io.tcf-asl. Setting a break-point in FSIndexRepositoryImpl line 967 takes you directly to the problematic location.

        Hm... an alternative might be that I serialize the CAS in which this problem occurs and try to build a test-case based on that....

        Any preferences?

        Show
        rec Richard Eckart de Castilho added a comment - Tried with rev 1653378 - didn't seem to help. moveTo(fs) in LeafPointerIterator:967 still updates the itPos from 66 to 531. After that index.isValid() still returns false. Since I'm unable to produce a minimal test case, maybe I can invite you to test in-situ. You could check out https://dkpro-core-asl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-asl/trunk in your Eclipse and debug the test de.tudarmstadt.ukp.dkpro.core.io.tcf.TcfReaderWriterTest.test1() in the Maven module de.tudarmstadt.ukp.dkpro.core.io.tcf-asl. Setting a break-point in FSIndexRepositoryImpl line 967 takes you directly to the problematic location. Hm... an alternative might be that I serialize the CAS in which this problem occurs and try to build a test-case based on that.... Any preferences?
        Hide
        schor Marshall Schor added a comment -

        Ok, fixed under uima-4192 - please update to head and try (Fingers crossed)...

        Show
        schor Marshall Schor added a comment - Ok, fixed under uima-4192 - please update to head and try (Fingers crossed)...
        Hide
        schor Marshall Schor added a comment -

        found a bunch more issues with offset handling - working on the fixes..

        Show
        schor Marshall Schor added a comment - found a bunch more issues with offset handling - working on the fixes..
        Hide
        rec Richard Eckart de Castilho added a comment -

        I tried, but as you noted - breaks other things. Nasty little problem. Thanks for looking into it!

        Show
        rec Richard Eckart de Castilho added a comment - I tried, but as you noted - breaks other things. Nasty little problem. Thanks for looking into it!
        Hide
        schor Marshall Schor added a comment -

        well, that broke other things... investigating...

        Show
        schor Marshall Schor added a comment - well, that broke other things... investigating...
        Hide
        schor Marshall Schor added a comment -

        right you are - I was never any good with integer arithmetic... Looks like the IntBitSet get() method is wrong too - it should return its argument with no adjustment I think. Can you try:

          @Override
          public int get(int position) {
            assert(set.get(position - offset));
            return position;
          }
        // and
          @Override
          public boolean isValid(int position) {
            return (position >= 0) && set.get(position - offset);
          }
        
        Show
        schor Marshall Schor added a comment - right you are - I was never any good with integer arithmetic... Looks like the IntBitSet get() method is wrong too - it should return its argument with no adjustment I think. Can you try: @Override public int get( int position) { assert (set.get(position - offset)); return position; } // and @Override public boolean isValid( int position) { return (position >= 0) && set.get(position - offset); }
        Hide
        rec Richard Eckart de Castilho added a comment - - edited

        I tried adding and substracting, but both does not fix the problem. Substracting breaks the code at a different place and adding doesn't improve.

        The original iterator had a valid position of 66. The copy is "moveTo"-ed to 531 which is invalid. I suspect that 531 is the CAS address of the annotation and 66 is the internal offset-index position of the annotation. I think the problem is there somewhere, but have no idea what kind address should be used where.

        Show
        rec Richard Eckart de Castilho added a comment - - edited I tried adding and substracting, but both does not fix the problem. Substracting breaks the code at a different place and adding doesn't improve. The original iterator had a valid position of 66. The copy is "moveTo"-ed to 531 which is invalid. I suspect that 531 is the CAS address of the annotation and 66 is the internal offset-index position of the annotation. I think the problem is there somewhere, but have no idea what kind address should be used where.
        Hide
        rec Richard Eckart de Castilho added a comment -

        If I understand your comment correctly, then I believe your diagnosis is wrong.The respective line currently gets a position that is loo big (e.g. 531 instead of 66). Adding the offset to that would not work. If at all, then the offset would need to be subtracted.

        My bet would be that find(66) should return 66 instead of 531.

        Show
        rec Richard Eckart de Castilho added a comment - If I understand your comment correctly, then I believe your diagnosis is wrong.The respective line currently gets a position that is loo big (e.g. 531 instead of 66). Adding the offset to that would not work. If at all, then the offset would need to be subtracted. My bet would be that find(66) should return 66 instead of 531.
        Hide
        schor Marshall Schor added a comment -

        Richard, excellent debugging! Thank you very much. You are correct that the isValid() operation is wrong. In the Class IntBitSet, all accesses to the underlying bit set should be done after adjusting the incoming key with the offset. This was missed in the isValid(). method (and in IntBitSet's get() method within an assert(...)). I'll fix this. You can try yourself before I commit the fix, change line 276 of IntBitSet to
        return (position >= 0) && set.get(position + offset); .. please post if that works.

        Show
        schor Marshall Schor added a comment - Richard, excellent debugging! Thank you very much. You are correct that the isValid() operation is wrong. In the Class IntBitSet, all accesses to the underlying bit set should be done after adjusting the incoming key with the offset. This was missed in the isValid(). method (and in IntBitSet's get() method within an assert(...)). I'll fix this. You can try yourself before I commit the fix, change line 276 of IntBitSet to return (position >= 0) && set.get(position + offset); .. please post if that works.
        Hide
        rec Richard Eckart de Castilho added a comment - - edited

        Looking at this.index I didn't think it'd get us much further. So instead of posting its value, I've tried to track this down further and modified the LeafPointerIterator locally as follows allowing me to step into the index.isValid() method after the moving:

        Class: FSIndexRepositoryImpl.LeafPointerIterator
        
        965:    private LeafPointerIterator(IndexIteratorCachePair iicp, int fs) {
        966:      this(iicp);
        967:      moveTo(fs);
          +:      index.isValid();
        968:    }
        

        Stepping this through shows that index.isValid() remains true after line 996 but becomes false after line 967. So the problem appears to be triggered in moveTo(fs).

        Looking what happens in moveTo, I found that it eventually updates the index position at

        Class: FSBagIndex.IntVectorIterator.moveTo(int)
        
        145:    public void moveTo(int i) {
        146:      this.itPos = find(i);
        

        The position passed here as i = 531.
        The index uses a PositiveIntSet_impl with offset = 465 and set = {66}.
        Line 146 discovers that position 531 (= offset + set[0]) is in the index and sets the itPos to from previously 66 to 531, the value returned from find(66). Maybe find should have returned 66 instead of 531?

        When calling index.isValid() now, the flow eventually ends up in IntBitSet.isValid(p) with p = 531. However, set.get(531) returns false because set.get() expects a value relative to the offset. set.get(66) returns true.

        I wonder if IntBitSet should throw an IndexOutOfBoundsException when accessing a position > offset + set[size - 1].

        Show
        rec Richard Eckart de Castilho added a comment - - edited Looking at this.index I didn't think it'd get us much further. So instead of posting its value, I've tried to track this down further and modified the LeafPointerIterator locally as follows allowing me to step into the index.isValid() method after the moving: Class: FSIndexRepositoryImpl.LeafPointerIterator 965: private LeafPointerIterator(IndexIteratorCachePair iicp, int fs) { 966: this(iicp); 967: moveTo(fs); +: index.isValid(); 968: } Stepping this through shows that index.isValid() remains true after line 996 but becomes false after line 967 . So the problem appears to be triggered in moveTo(fs) . Looking what happens in moveTo , I found that it eventually updates the index position at Class: FSBagIndex.IntVectorIterator.moveTo(int) 145: public void moveTo(int i) { 146: this.itPos = find(i); The position passed here as i = 531 . The index uses a PositiveIntSet_impl with offset = 465 and set = {66} . Line 146 discovers that position 531 (= offset + set [0] ) is in the index and sets the itPos to from previously 66 to 531 , the value returned from find(66) . Maybe find should have returned 66 instead of 531 ? When calling index.isValid() now, the flow eventually ends up in IntBitSet.isValid(p) with p = 531 . However, set.get(531) returns false because set.get() expects a value relative to the offset. set.get(66) returns true . I wonder if IntBitSet should throw an IndexOutOfBoundsException when accessing a position > offset + set[size - 1] .
        Hide
        schor Marshall Schor added a comment -

        ok, I'll see if I can reproduce ... Can you use Eclipse step-into (PF5) at
        953: moveTo(fs);
        and then pretty print the value of this.index (in my version, line 1013)?

        Show
        schor Marshall Schor added a comment - ok, I'll see if I can reproduce ... Can you use Eclipse step-into (PF5) at 953: moveTo(fs); and then pretty print the value of this.index (in my version, line 1013)?
        Hide
        rec Richard Eckart de Castilho added a comment -

        It is interesting to note that the builds on my Jenkins fail at this point no matter whether I set ALLOW_DUP_ADD_TO_INDEXES or not (I tried setting it programmatically in the unit tests which helps on my workstation). Not sure if that may be due to snapshots with the UIMA-4187 not having hit the repositories yet. But anyway, I'd consider my local test in Eclipse the more up-to-date one.

        Show
        rec Richard Eckart de Castilho added a comment - It is interesting to note that the builds on my Jenkins fail at this point no matter whether I set ALLOW_DUP_ADD_TO_INDEXES or not (I tried setting it programmatically in the unit tests which helps on my workstation). Not sure if that may be due to snapshots with the UIMA-4187 not having hit the repositories yet. But anyway, I'd consider my local test in Eclipse the more up-to-date one.
        Hide
        rec Richard Eckart de Castilho added a comment - - edited

        I'm running in Eclipse with a local checked out version of uimaj-core. Just updated the checkout to make sure it's the latest version (rev 1653378) and tested with and without ALLOW_DUP_ADD_TO_INDEXES - yes, I'm pretty sure I'm testing against the latest code.

        Show
        rec Richard Eckart de Castilho added a comment - - edited I'm running in Eclipse with a local checked out version of uimaj-core. Just updated the checkout to make sure it's the latest version (rev 1653378) and tested with and without ALLOW_DUP_ADD_TO_INDEXES - yes, I'm pretty sure I'm testing against the latest code.
        Hide
        schor Marshall Schor added a comment -

        sorry to ask, but there was a bug fix (Jira https://issues.apache.org/jira/browse/UIMA-4187 ) which could cause this issue. Are you running with a build that is after this fix was put in?

        Show
        schor Marshall Schor added a comment - sorry to ask, but there was a bug fix (Jira https://issues.apache.org/jira/browse/UIMA-4187 ) which could cause this issue. Are you running with a build that is after this fix was put in?
        Hide
        rec Richard Eckart de Castilho added a comment -

        Further debugging in org.apache.uima.cas.impl.FSIndexRepositoryImpl.LeafPointerIterator.LeafPointerIterator(FSIndexRepositoryImpl, IndexIteratorCachePair, int):

        org.apache.uima.cas.impl.FSIndexRepositoryImpl.LeafPointerIterator.copy()
        996:    public Object copy() {
        997:      // If this.isValid(), return a copy pointing to the same element.
        998:     if (this.isValid()) {
        999:        return new LeafPointerIterator(this.iicp, this.get());
        
        org.apache.uima.cas.impl.FSIndexRepositoryImpl.LeafPointerIterator.LeafPointerIterator(FSIndexRepositoryImpl, IndexIteratorCachePair, int)
        951:    private LeafPointerIterator(IndexIteratorCachePair iicp, int fs) {
        952:      this(iicp);
        953:      moveTo(fs);
        954:    }
        

        After 952, index.isValid() is still true, but after 953, index.isValid() becomes false.

        It looks definitely like a bug: the copy of a valid iterator becomes invalid, but it should remain valid.

        Show
        rec Richard Eckart de Castilho added a comment - Further debugging in org.apache.uima.cas.impl.FSIndexRepositoryImpl.LeafPointerIterator.LeafPointerIterator(FSIndexRepositoryImpl, IndexIteratorCachePair, int): org.apache.uima.cas.impl.FSIndexRepositoryImpl.LeafPointerIterator.copy() 996: public Object copy() { 997: // If this.isValid(), return a copy pointing to the same element. 998: if (this.isValid()) { 999: return new LeafPointerIterator(this.iicp, this.get()); org.apache.uima.cas.impl.FSIndexRepositoryImpl.LeafPointerIterator.LeafPointerIterator(FSIndexRepositoryImpl, IndexIteratorCachePair, int) 951: private LeafPointerIterator(IndexIteratorCachePair iicp, int fs) { 952: this(iicp); 953: moveTo(fs); 954: } After 952, index.isValid() is still true, but after 953, index.isValid() becomes false. It looks definitely like a bug: the copy of a valid iterator becomes invalid, but it should remain valid.
        Hide
        rec Richard Eckart de Castilho added a comment -

        I'm looking into this in more detail. Unfortunately, I didn't manage yet to set up a minimal test case. However, what I found is that at some point, I have an FSIterator "i" with the following configuration (toString()).

        FSIteratorWrapper [it=LeafPointerIterator [iicp=IndexIteratorCachePair, index=FSLeafIndexImpl [type=de.tudarmstadt.ukp.dkpro.core.api.coref.type.CoreferenceChain, kind=Default Bag]
          cache 0  FSLeafIndexImpl [type=de.tudarmstadt.ukp.dkpro.core.api.coref.type.CoreferenceChain, kind=Default Bag]
        , index=org.apache.uima.cas.impl.FSBagIndex$IntVectorIterator@39dcf4b0]]
        

        This iterator is valid (i.isValid() returns true), but a copy of the iterator is no longer valid (i.copy().isValid() returns false) - it does not happen when ALLOW_DUP_ADD_TO_INDEXES to "true" but consistently happens when ALLOW_DUP_ADD_TO_INDEXES is not set.

        Apparently, this does not happen when I create a minimal CAS with minimal annotations of my CoreferenceChain type. But it happens in one situation where the CAS already contains all kinds of other annotations.

        Show
        rec Richard Eckart de Castilho added a comment - I'm looking into this in more detail. Unfortunately, I didn't manage yet to set up a minimal test case. However, what I found is that at some point, I have an FSIterator "i" with the following configuration (toString()). FSIteratorWrapper [it=LeafPointerIterator [iicp=IndexIteratorCachePair, index=FSLeafIndexImpl [type=de.tudarmstadt.ukp.dkpro.core.api.coref.type.CoreferenceChain, kind=Default Bag] cache 0 FSLeafIndexImpl [type=de.tudarmstadt.ukp.dkpro.core.api.coref.type.CoreferenceChain, kind=Default Bag] , index=org.apache.uima.cas.impl.FSBagIndex$IntVectorIterator@39dcf4b0]] This iterator is valid (i.isValid() returns true), but a copy of the iterator is no longer valid (i.copy().isValid() returns false) - it does not happen when ALLOW_DUP_ADD_TO_INDEXES to "true" but consistently happens when ALLOW_DUP_ADD_TO_INDEXES is not set. Apparently, this does not happen when I create a minimal CAS with minimal annotations of my CoreferenceChain type. But it happens in one situation where the CAS already contains all kinds of other annotations.
        Hide
        schor Marshall Schor added a comment -

        If I understand you correctly, it should be OK, without the allow_dup_add_to_indexes. The thing that would not be ok is to have the same identical ( in the == sense) feature structure added to the index multiple times. Can you post a test case that fails?

        Show
        schor Marshall Schor added a comment - If I understand you correctly, it should be OK, without the allow_dup_add_to_indexes. The thing that would not be ok is to have the same identical ( in the == sense) feature structure added to the index multiple times. Can you post a test case that fails?
        Hide
        rec Richard Eckart de Castilho added a comment -

        I'm running into an unexpected situation with the changes from this issue (I suppose).

        I have modelled a linked list with an explicit head pointer in the following way:

        Head extends AnnotationBase {
          Link first;
        }
        
        Link extends Annotation {
          Link next;
        }
        

        With the new behavior, apparently there is a problem having multiple Heads to the CAS. When I set ALLOW_DUP_ADD_TO_INDEXES to "true", all is fine, but with the new behavior, I get failing unit tests in DKPro Core.

        I thought that UIMA creates a bag index by default and adding multiple Head's to a bag index should be ok, right?

        Show
        rec Richard Eckart de Castilho added a comment - I'm running into an unexpected situation with the changes from this issue (I suppose). I have modelled a linked list with an explicit head pointer in the following way: Head extends AnnotationBase { Link first; } Link extends Annotation { Link next; } With the new behavior, apparently there is a problem having multiple Heads to the CAS. When I set ALLOW_DUP_ADD_TO_INDEXES to "true", all is fine, but with the new behavior, I get failing unit tests in DKPro Core. I thought that UIMA creates a bag index by default and adding multiple Head's to a bag index should be ok, right?
        Hide
        schor Marshall Schor added a comment -

        Changed this to an "improvement" and changed the description to what's being implemented.

        Show
        schor Marshall Schor added a comment - Changed this to an "improvement" and changed the description to what's being implemented.
        Hide
        schor Marshall Schor added a comment -

        The answer to Richard's two questions above: 1) Yes the current serialization does include non-indexed FS reachable via references from some indexed FS. and 2) the current implementation APIs only support adding to all indexes (in a view); you can't add to some index and not others within one view. Views have separate sets of what's indexed in them, and each one is separately handled in the remote case.

        Show
        schor Marshall Schor added a comment - The answer to Richard's two questions above: 1) Yes the current serialization does include non-indexed FS reachable via references from some indexed FS. and 2) the current implementation APIs only support adding to all indexes (in a view); you can't add to some index and not others within one view. Views have separate sets of what's indexed in them, and each one is separately handled in the remote case.
        Hide
        schor Marshall Schor added a comment -

        After some discussion (on the mailing list), there's a basic question of whether or not there is a use for adding a FS to the indexes, multiple times.

        UIMA does explicitly offer "Bag" and "Set" indexes, with a bit of complicated notion for Set: It has the normal semantics that a FS won't be added to a Set index multiple times, if another FS (or the identical FS) that "matches" (according to the Set index's particular keys) is already indexed.

        The Bag is there to allow the same FS to be indexed multiple times. Is this a used/useful feature? Here's one possible use case: suppose you're doing some kind of "sampling" to feed a machine-learning algorithm, and need to "oversample" some particular type - you could define a bag indexes and do multiple inserts of the same FS into that bag, for use by a down-stream annotator.

        Show
        schor Marshall Schor added a comment - After some discussion (on the mailing list), there's a basic question of whether or not there is a use for adding a FS to the indexes, multiple times. UIMA does explicitly offer "Bag" and "Set" indexes, with a bit of complicated notion for Set: It has the normal semantics that a FS won't be added to a Set index multiple times, if another FS (or the identical FS) that "matches" (according to the Set index's particular keys) is already indexed. The Bag is there to allow the same FS to be indexed multiple times. Is this a used/useful feature? Here's one possible use case: suppose you're doing some kind of "sampling" to feed a machine-learning algorithm, and need to "oversample" some particular type - you could define a bag indexes and do multiple inserts of the same FS into that bag, for use by a down-stream annotator.
        Hide
        rec Richard Eckart de Castilho added a comment -

        I seem to remember that one of the reasons to post to the mailing list instead of to Jira was, because it was easier to discuss things nevertheless

        • +1 for a delete/re-add behavior
        • +0 with a tendency to -1 for reinstating the old behavior - if there was a flag introduced, a version should be defined that the flag as well as the backwards-compatibility code is removed

        General questions:

        • what happens to FSes which are only reachable by other FSes and not indexed at all? Does the remote-case cover that?
        • what happens if an FS is added to some indexes but not to all? Does the remote-case cover that?
        Show
        rec Richard Eckart de Castilho added a comment - I seem to remember that one of the reasons to post to the mailing list instead of to Jira was, because it was easier to discuss things nevertheless +1 for a delete/re-add behavior +0 with a tendency to -1 for reinstating the old behavior - if there was a flag introduced, a version should be defined that the flag as well as the backwards-compatibility code is removed General questions: what happens to FSes which are only reachable by other FSes and not indexed at all? Does the remote-case cover that? what happens if an FS is added to some indexes but not to all? Does the remote-case cover that?

          People

          • Assignee:
            schor Marshall Schor
            Reporter:
            schor Marshall Schor
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development