Lucene - Core
  1. Lucene - Core
  2. LUCENE-4479

TokenSources.getTokenStream() doesn't return correctly for termvectors with positions but no offsets

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, Trunk
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The javadocs for TokenSources.getTokenStream(Terms, boolean) state:

      "Low level api. Returns a token stream or null if no offset info available
      in index. This can be used to feed the highlighter with a pre-parsed token
      stream"

      However, if the Terms instance passed in has positions but no offsets stored, a TokenStream is incorrectly returned, rather than null.

      This has the effect of incorrectly highlighting fields with term vectors and positions, but no offsets. All highlighting markup is prepended to the beginning of the field.

      1. LUCENE-4479.patch
        13 kB
        Alan Woodward
      2. LUCENE-4479.patch
        12 kB
        Alan Woodward
      3. LUCENE-4479.patch
        12 kB
        Alan Woodward
      4. LUCENE-4479.patch
        3 kB
        Alan Woodward

        Activity

        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Alan Woodward
        http://svn.apache.org/viewvc?view=revision&revision=1400505

        LUCENE-4479: Fix highlighter on index w/ positions but no offsets

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Alan Woodward http://svn.apache.org/viewvc?view=revision&revision=1400505 LUCENE-4479 : Fix highlighter on index w/ positions but no offsets
        Hide
        Alan Woodward added a comment -

        trunk: 1400504
        branch_4x: 1400505

        Show
        Alan Woodward added a comment - trunk: 1400504 branch_4x: 1400505
        Hide
        Alan Woodward added a comment -

        Patch, fixing javadocs errors. Ant precommit runs successfully.

        Show
        Alan Woodward added a comment - Patch, fixing javadocs errors. Ant precommit runs successfully.
        Hide
        Robert Muir added a comment -

        with the current patch, 'ant precommit' fails:

        ...
        -ecj-javadoc-lint-src:
         [ecj-lint] Compiling 45 source files
         [ecj-lint] ----------
         [ecj-lint] 1. ERROR in /Users/rmuir/workspace/lucene-trunk/lucene/highlighter/src/java/org/apache/lucene/search/highlight/TokenSources.java (at line 275)
         [ecj-lint] 	* @throws IOException
         [ecj-lint] 	          ^^^^^^^^^^^
         [ecj-lint] Javadoc: Description expected after this reference
         [ecj-lint] ----------
         [ecj-lint] 1 problem (1 error)
        
        BUILD FAILED
        
        Show
        Robert Muir added a comment - with the current patch, 'ant precommit' fails: ... -ecj-javadoc-lint-src: [ecj-lint] Compiling 45 source files [ecj-lint] ---------- [ecj-lint] 1. ERROR in /Users/rmuir/workspace/lucene-trunk/lucene/highlighter/src/java/org/apache/lucene/search/highlight/TokenSources.java (at line 275) [ecj-lint] * @throws IOException [ecj-lint] ^^^^^^^^^^^ [ecj-lint] Javadoc: Description expected after this reference [ecj-lint] ---------- [ecj-lint] 1 problem (1 error) BUILD FAILED
        Hide
        Alan Woodward added a comment -

        New patch, with CHANGES entry. Could someone cast a quick eye over it, and I'll commit.

        Show
        Alan Woodward added a comment - New patch, with CHANGES entry. Could someone cast a quick eye over it, and I'll commit.
        Hide
        Robert Muir added a comment -

        you're right: thanks!

        Show
        Robert Muir added a comment - you're right: thanks!
        Hide
        Alan Woodward added a comment -

        TokenSourcesTest.testOverlapWithOffset() tests it implicitly, although it would be nice to set the FieldType explicitly to document it I suppose.

        Show
        Alan Woodward added a comment - TokenSourcesTest.testOverlapWithOffset() tests it implicitly, although it would be nice to set the FieldType explicitly to document it I suppose.
        Hide
        Robert Muir added a comment -

        Alan this patch looks good. Thanks for tackling this.

        Do you know off-hand if we have a test anywhere for the case of offsets but NO positions (which there is logic to handle, and looks correct, I'm just curious).

        Show
        Robert Muir added a comment - Alan this patch looks good. Thanks for tackling this. Do you know off-hand if we have a test anywhere for the case of offsets but NO positions (which there is logic to handle, and looks correct, I'm just curious).
        Hide
        Alan Woodward added a comment -

        Patch, changing the API slightly:

        • getTokenStream(Terms, bool) now throws IllegalArgumentException if the Terms does not have offsets.
        • renames getTokenStream(IndexReader, int, String) to getTokenStreamWithOffsets. Not sure I like the name... Also added some javadocs here. Instead of throwing IllegalArgumentExceptions, this just returns null if it can't build a tokenstream for whatever reason (no termvector, no offsets or positions).

        The only API consumer here is DefaultSolrHighlighter, which I've edited accordingly. I also added a test to Solr to ensure that fields with termvectors and positions but no offsets are correctly highlighted.

        All tests pass

        Show
        Alan Woodward added a comment - Patch, changing the API slightly: getTokenStream(Terms, bool) now throws IllegalArgumentException if the Terms does not have offsets. renames getTokenStream(IndexReader, int, String) to getTokenStreamWithOffsets. Not sure I like the name... Also added some javadocs here. Instead of throwing IllegalArgumentExceptions, this just returns null if it can't build a tokenstream for whatever reason (no termvector, no offsets or positions). The only API consumer here is DefaultSolrHighlighter, which I've edited accordingly. I also added a test to Solr to ensure that fields with termvectors and positions but no offsets are correctly highlighted. All tests pass
        Hide
        Alan Woodward added a comment -

        It looks as though these classes are only used by the DefaultSolrHighlighter, which is broken under these circumstances. I'll make those changes, and see if any tests fail...

        Show
        Alan Woodward added a comment - It looks as though these classes are only used by the DefaultSolrHighlighter, which is broken under these circumstances. I'll make those changes, and see if any tests fail...
        Hide
        Robert Muir added a comment -

        I'm not sure whats going on here If you can sort it out it would be fantastic.

        In general if the method requires that the Terms object has offsets or positions the checks should just look like:
        if (tpv.hasOffsets())
        ...
        if (tpv.hasPositions())
        ...

        But I'm really confused what the contracts of these methods should be. definitely bugs

        Show
        Robert Muir added a comment - I'm not sure whats going on here If you can sort it out it would be fantastic. In general if the method requires that the Terms object has offsets or positions the checks should just look like: if (tpv.hasOffsets()) ... if (tpv.hasPositions()) ... But I'm really confused what the contracts of these methods should be. definitely bugs
        Hide
        Alan Woodward added a comment -

        So this snippet:

        if (!tokenPositionsGuaranteedContiguous && hasPositions(tpv))

        { return new TokenStreamFromTermPositionVector(tpv); }

        should presumably also check for offsets, and then the extra hasOffsets check in TSFTPV can be removed.

        Show
        Alan Woodward added a comment - So this snippet: if (!tokenPositionsGuaranteedContiguous && hasPositions(tpv)) { return new TokenStreamFromTermPositionVector(tpv); } should presumably also check for offsets, and then the extra hasOffsets check in TSFTPV can be removed.
        Hide
        Alan Woodward added a comment -

        Ah, but TokenStreamFromTermPositionVector checks if offsets are there, and works round it if they're not. Which is presumably also a bug?

        Show
        Alan Woodward added a comment - Ah, but TokenStreamFromTermPositionVector checks if offsets are there, and works round it if they're not. Which is presumably also a bug?
        Hide
        Robert Muir added a comment -

        but the behavior of throwing IAE is still bogus, as it disagrees with the javadocs... I just feel like there is another bug involved too.

        The javadocs are also wrong, because looking at the code logic it really needs positions AND offsets.

        Show
        Robert Muir added a comment - but the behavior of throwing IAE is still bogus, as it disagrees with the javadocs... I just feel like there is another bug involved too. The javadocs are also wrong, because looking at the code logic it really needs positions AND offsets.
        Hide
        Robert Muir added a comment -

        However, if the Terms instance passed in has positions but no offsets stored, a TokenStream is incorrectly returned, rather than null.

        Actually thats not true: you will get an IllegalArgumentException because OffsetAttribute won't allow offsets to be set to negative values (and the DocsAndPositionsEnum will return -1).

        So something else is going on here!

        Show
        Robert Muir added a comment - However, if the Terms instance passed in has positions but no offsets stored, a TokenStream is incorrectly returned, rather than null. Actually thats not true: you will get an IllegalArgumentException because OffsetAttribute won't allow offsets to be set to negative values (and the DocsAndPositionsEnum will return -1). So something else is going on here!
        Hide
        Alan Woodward added a comment -

        Patch with a test illustrating the problem

        Show
        Alan Woodward added a comment - Patch with a test illustrating the problem

          People

          • Assignee:
            Alan Woodward
            Reporter:
            Alan Woodward
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development