Lucene - Core
  1. Lucene - Core
  2. LUCENE-5977

IW should safeguard against token streams returning invalid offsets for multi-valued fields

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.9, 4.9.1, 4.10, 4.10.1
    • Fix Version/s: 4.10.2, 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We have a custom token stream that emits information about offsets of each token. My (wrong) assumption was that for multi-valued fields a token stream's offset information is magically shifted, much like this is the case with positions. It's not the case – offsets should be increasing and monotonic across all instances of a field, even if it has custom token streams. So, something like this:

              doc.add(new Field("field-foo", new CannedTokenStream(token("bar", 1, 150, 160)), ftype));
              doc.add(new Field("field-foo", new CannedTokenStream(token("bar", 1,  50,  60)), ftype));
      

      where the token function is defined as:

      token(String image, int positionIncrement, int startOffset, int endOffset)
      

      will result in either a cryptic assertion thrown from IW:

      Exception in thread "main" java.lang.AssertionError
      	at org.apache.lucene.index.FreqProxTermsWriterPerField.writeOffsets(FreqProxTermsWriterPerField.java:99)
      

      or nothing (or a codec error) if run without assertions.

      Obviously returning non-shifted offsets from subsequent token streams makes little sense but I wonder if it could be made more explicit (or asserted) that offsets need to be increasing between multiple-values. The minimum is to add some documentation to OffsetAttribute. I don't know if offsets should be shifted automatically, as it's the case with positions – this would change the semantics of existing tokenizers and filters which implement such shifting internally already.

        Activity

        Hide
        Dawid Weiss added a comment - - edited

        A full example which shows the problem. Run it with -ea – you'll get the assertion. Run it without assertions and it'll pass.

        import org.apache.lucene.analysis.Token;
        import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
        import org.apache.lucene.codecs.Codec;
        import org.apache.lucene.codecs.simpletext.SimpleTextCodec;
        import org.apache.lucene.document.Document;
        import org.apache.lucene.document.Field;
        import org.apache.lucene.document.FieldType;
        import org.apache.lucene.index.FieldInfo.IndexOptions;
        import org.apache.lucene.index.IndexWriter;
        import org.apache.lucene.index.IndexWriterConfig;
        import org.apache.lucene.store.Directory;
        import org.apache.lucene.store.RAMDirectory;
        import org.apache.lucene.util.Version;
        import org.apache.lucene.analysis.CannedTokenStream;
        
        public class OffsetIndexingBug {
          
          public static void main(String[] args) throws Exception {
            Codec.setDefault(new SimpleTextCodec());
            Version version = Version.LUCENE_CURRENT;
            IndexWriterConfig conf = new IndexWriterConfig(version, new WhitespaceAnalyzer(version));
            conf.setUseCompoundFile(false);
            try (Directory d = new RAMDirectory()) {
              try (IndexWriter iw = new IndexWriter(d, conf)) {
                Document doc = new Document();
        
                FieldType ftype = new FieldType();
                ftype.setIndexed(true);
                ftype.setStored(false);
                ftype.setOmitNorms(true);
                
                ftype.setStoreTermVectors(true);
                ftype.setStoreTermVectorPositions(true);
                ftype.setStoreTermVectorOffsets(true);
        
                ftype.setTokenized(true);
                ftype.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
                ftype.freeze();
        
                // note "use"'s offset is negative with respect to the first field value. 
                doc.add(new Field("field-foo", new CannedTokenStream(token("use", 1, 150, 160)), ftype));
                doc.add(new Field("field-foo", new CannedTokenStream(token("use", 1,  50,  60)), ftype));
                iw.addDocument(doc);
              }
            }
          }
        
          private static Token token(String image, int positionIncrement, int soffset, int eoffset) {
            Token t = new Token();
            t.setPositionIncrement(positionIncrement);
            t.setOffset(soffset, eoffset);
            t.append(image);
            return t;
          }
        }
        
        Show
        Dawid Weiss added a comment - - edited A full example which shows the problem. Run it with -ea – you'll get the assertion. Run it without assertions and it'll pass. import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.core.WhitespaceAnalyzer; import org.apache.lucene.codecs.Codec; import org.apache.lucene.codecs.simpletext.SimpleTextCodec; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.FieldType; import org.apache.lucene.index.FieldInfo.IndexOptions; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.Version; import org.apache.lucene.analysis.CannedTokenStream; public class OffsetIndexingBug { public static void main( String [] args) throws Exception { Codec.setDefault( new SimpleTextCodec()); Version version = Version.LUCENE_CURRENT; IndexWriterConfig conf = new IndexWriterConfig(version, new WhitespaceAnalyzer(version)); conf.setUseCompoundFile( false ); try (Directory d = new RAMDirectory()) { try (IndexWriter iw = new IndexWriter(d, conf)) { Document doc = new Document(); FieldType ftype = new FieldType(); ftype.setIndexed( true ); ftype.setStored( false ); ftype.setOmitNorms( true ); ftype.setStoreTermVectors( true ); ftype.setStoreTermVectorPositions( true ); ftype.setStoreTermVectorOffsets( true ); ftype.setTokenized( true ); ftype.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); ftype.freeze(); // note "use" 's offset is negative with respect to the first field value. doc.add( new Field( "field-foo" , new CannedTokenStream(token( "use" , 1, 150, 160)), ftype)); doc.add( new Field( "field-foo" , new CannedTokenStream(token( "use" , 1, 50, 60)), ftype)); iw.addDocument(doc); } } } private static Token token( String image, int positionIncrement, int soffset, int eoffset) { Token t = new Token(); t.setPositionIncrement(positionIncrement); t.setOffset(soffset, eoffset); t.append(image); return t; } }
        Hide
        Dawid Weiss added a comment -

        I looked a bit more closely at the logic of updating offsets in the DefaultIndexingChain and I see offsets are propagated for multiple fields, here:

                // trigger streams to perform end-of-stream operations
                stream.end();
        
                // TODO: maybe add some safety? then again, its already checked 
                // when we come back around to the field...
                invertState.position += invertState.posIncrAttribute.getPositionIncrement();
                invertState.offset += invertState.offsetAttribute.endOffset();
        

        So the problem with my implementation was that it should have set offsets properly in end(). I still feel this should be verified / asserted cleaner somehow, so I'll leave this issue open looking for suggestions.

        Show
        Dawid Weiss added a comment - I looked a bit more closely at the logic of updating offsets in the DefaultIndexingChain and I see offsets are propagated for multiple fields, here: // trigger streams to perform end-of-stream operations stream.end(); // TODO: maybe add some safety? then again, its already checked // when we come back around to the field... invertState.position += invertState.posIncrAttribute.getPositionIncrement(); invertState.offset += invertState.offsetAttribute.endOffset(); So the problem with my implementation was that it should have set offsets properly in end(). I still feel this should be verified / asserted cleaner somehow, so I'll leave this issue open looking for suggestions.
        Hide
        Robert Muir added a comment -

        The bug is older than 4.9 actually: it existed in the previous indexing chain, too.

        I ran the test program against 4.8, even with assertingcodec, it just silently "succeeds".

        But if you add checkindex call:

        Segments file=segments_1 numSegments=1 version=4.8 format=
          1 of 1: name=_0 docCount=1
            codec=Asserting
            compound=false
            numFiles=12
            size (MB)=0.001
            diagnostics = {timestamp=1411990327375, os=Linux, os.version=3.13.0-24-generic, source=flush, lucene.version=4.8-SNAPSHOT, os.arch=amd64, java.version=1.7.0_55, java.vendor=Oracle Corporation}
            no deletions
            test: open reader.........OK
            test: check integrity.....OK
            test: check live docs.....OK
            test: fields..............OK [1 fields]
            test: field norms.........OK [0 fields]
            test: terms, freq, prox...ERROR: java.lang.RuntimeException: term [75 73 65]: doc 0: pos 1: startOffset -2147483597 is out of bounds
        java.lang.RuntimeException: term [75 73 65]: doc 0: pos 1: startOffset -2147483597 is out of bounds
        	at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:944)
        	at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1278)
        	at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:626)
        	at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199)
        	at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191)
        	at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48)
            test: stored fields.......OK [0 total field count; avg 0 fields per doc]
            test: term vectors........ERROR [vector term=[75 73 65] field=field-foo doc=0: startOffset=51 differs from postings startOffset=-2147483597]
        java.lang.RuntimeException: vector term=[75 73 65] field=field-foo doc=0: startOffset=51 differs from postings startOffset=-2147483597
        	at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1748)
        	at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:632)
        	at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199)
        	at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191)
        	at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48)
            test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET]
        FAILED
            WARNING: fixIndex() would remove reference to this segment; full exception:
        java.lang.RuntimeException: Term Index test failed
        	at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:641)
        	at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199)
        	at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191)
        	at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48)
        
        WARNING: 1 broken segments (containing 1 documents) detected
        
        Show
        Robert Muir added a comment - The bug is older than 4.9 actually: it existed in the previous indexing chain, too. I ran the test program against 4.8, even with assertingcodec, it just silently "succeeds". But if you add checkindex call: Segments file=segments_1 numSegments=1 version=4.8 format= 1 of 1: name=_0 docCount=1 codec=Asserting compound=false numFiles=12 size (MB)=0.001 diagnostics = {timestamp=1411990327375, os=Linux, os.version=3.13.0-24-generic, source=flush, lucene.version=4.8-SNAPSHOT, os.arch=amd64, java.version=1.7.0_55, java.vendor=Oracle Corporation} no deletions test: open reader.........OK test: check integrity.....OK test: check live docs.....OK test: fields..............OK [1 fields] test: field norms.........OK [0 fields] test: terms, freq, prox...ERROR: java.lang.RuntimeException: term [75 73 65]: doc 0: pos 1: startOffset -2147483597 is out of bounds java.lang.RuntimeException: term [75 73 65]: doc 0: pos 1: startOffset -2147483597 is out of bounds at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:944) at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1278) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:626) at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199) at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191) at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48) test: stored fields.......OK [0 total field count; avg 0 fields per doc] test: term vectors........ERROR [vector term=[75 73 65] field=field-foo doc=0: startOffset=51 differs from postings startOffset=-2147483597] java.lang.RuntimeException: vector term=[75 73 65] field=field-foo doc=0: startOffset=51 differs from postings startOffset=-2147483597 at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1748) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:632) at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199) at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191) at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48) test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET] FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.RuntimeException: Term Index test failed at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:641) at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199) at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191) at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48) WARNING: 1 broken segments (containing 1 documents) detected
        Hide
        Robert Muir added a comment -

        The bug is some validation checks rely on lastStartOffset/lastPosition, but this is currently reset to 0 on each field instance.

        We should instead track it across all instances of the field. I converted Dawid's test to a case in TestPostingsOffsets and moved these two state variables to FieldInvertState.

        Show
        Robert Muir added a comment - The bug is some validation checks rely on lastStartOffset/lastPosition, but this is currently reset to 0 on each field instance. We should instead track it across all instances of the field. I converted Dawid's test to a case in TestPostingsOffsets and moved these two state variables to FieldInvertState.
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1628192 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1628192 ]

        LUCENE-5977: Fix tokenstream safety checks in IndexWriter to work across multi-valued fields

        Show
        ASF subversion and git services added a comment - Commit 1628192 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1628192 ] LUCENE-5977 : Fix tokenstream safety checks in IndexWriter to work across multi-valued fields
        Hide
        ASF subversion and git services added a comment -

        Commit 1628196 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1628196 ]

        LUCENE-5977: Fix tokenstream safety checks in IndexWriter to work across multi-valued fields

        Show
        ASF subversion and git services added a comment - Commit 1628196 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1628196 ] LUCENE-5977 : Fix tokenstream safety checks in IndexWriter to work across multi-valued fields
        Hide
        ASF subversion and git services added a comment -

        Commit 1628200 from Robert Muir in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1628200 ]

        LUCENE-5977: Fix tokenstream safety checks in IndexWriter to work across multi-valued fields

        Show
        ASF subversion and git services added a comment - Commit 1628200 from Robert Muir in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1628200 ] LUCENE-5977 : Fix tokenstream safety checks in IndexWriter to work across multi-valued fields
        Hide
        Robert Muir added a comment -

        Thank you for debugging this Dawid!

        Show
        Robert Muir added a comment - Thank you for debugging this Dawid!

          People

          • Assignee:
            Unassigned
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development