Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-763

LuceneDictionary skips first word in enumeration

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.2
    • Component/s: core/other
    • Labels:
      None
    • Environment:

      Windows Sun JRE 1.4.2_10_b03

    • Lucene Fields:
      New

      Description

      The current code for LuceneDictionary will always skip the first word of the TermEnum. The reason is that it doesn't initially retrieve TermEnum.term - its first call is to TermEnum.next, which moves it past the first term (line 76).
      To see this problem cause a failure, add this test to TestSpellChecker:
      similar = spellChecker.suggestSimilar("eihgt",2);
      assertEquals(1, similar.length);
      assertEquals(similar[0], "eight");

      Because "eight" is the first word in the index, it will fail.

      1. LuceneDictionary.java
        3 kB
        Christian Mallwitz
      2. TestLuceneDictionary.java
        7 kB
        Christian Mallwitz

        Activity

        Hide
        steven_parkes Steven Parkes added a comment -

        I was wondering about something very similar just recently: to call TermEnum.next() or not to call TermEnum.next() to get the first term. However, in my case I use terms() rather than terms( Term ) and there's the rub.

        After looking through things, there looks to be an inconsistency between the two cases. terms( Term ) seeks such that the new TermEnum object is ready. On the other hand, terms() leaves the enum state "before" the first term: you need to call next() first and calling term() earlier will return null.

        I've only tried this against SegmentReader#terms(...).

        This difference of behaviour isn't mentioned in the documentation.

        It would seem like it would be nice to have the same behaviour between the two calls but I'm a little worried that half the existing code would break. Should we just document the existing behaviour?

        In that case, the spell checker does just need to get rid of the extra next() call.

        While investigating, I noticed there are several other issues around the spell checker now, both the functional code and test code. It plays a bit fast and loose with when index readers and writers are opened. Perhaps it used to work, depending on when things got flushed to disk, but it doesn't work for me now under the trunk.

        Show
        steven_parkes Steven Parkes added a comment - I was wondering about something very similar just recently: to call TermEnum.next() or not to call TermEnum.next() to get the first term. However, in my case I use terms() rather than terms( Term ) and there's the rub. After looking through things, there looks to be an inconsistency between the two cases. terms( Term ) seeks such that the new TermEnum object is ready. On the other hand, terms() leaves the enum state "before" the first term: you need to call next() first and calling term() earlier will return null. I've only tried this against SegmentReader#terms(...). This difference of behaviour isn't mentioned in the documentation. It would seem like it would be nice to have the same behaviour between the two calls but I'm a little worried that half the existing code would break. Should we just document the existing behaviour? In that case, the spell checker does just need to get rid of the extra next() call. While investigating, I noticed there are several other issues around the spell checker now, both the functional code and test code. It plays a bit fast and loose with when index readers and writers are opened. Perhaps it used to work, depending on when things got flushed to disk, but it doesn't work for me now under the trunk.
        Hide
        dtertman Dan Ertman added a comment -

        Ah, that makes sense. So the one basically behaves like ResultSet - the marker is before the first entry when initialized. Unfortunately SpellChecker uses the other.

        Show
        dtertman Dan Ertman added a comment - Ah, that makes sense. So the one basically behaves like ResultSet - the marker is before the first entry when initialized. Unfortunately SpellChecker uses the other.
        Hide
        christian.mallwitz@xbridge.com Christian Mallwitz added a comment -

        This is a fixed LuceneDictionary.

        Show
        christian.mallwitz@xbridge.com Christian Mallwitz added a comment - This is a fixed LuceneDictionary.
        Hide
        christian.mallwitz@xbridge.com Christian Mallwitz added a comment -

        This a unit test case for LuceneDictionary making sure it doesn't skip any of the words in the original index.

        Show
        christian.mallwitz@xbridge.com Christian Mallwitz added a comment - This a unit test case for LuceneDictionary making sure it doesn't skip any of the words in the original index.
        Hide
        christian.mallwitz@xbridge.com Christian Mallwitz added a comment -

        I have added a fixed LuceneDictionary.java and a unit test case for it which should go to
        contrib/spellchecker/src/java/org/apache/lucene/search/spell/LuceneDictionary.java
        and
        contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLuceneDictionary.java
        respectively.

        This is on top of the current lucene-trunk.

        Cheers
        Christian

        Show
        christian.mallwitz@xbridge.com Christian Mallwitz added a comment - I have added a fixed LuceneDictionary.java and a unit test case for it which should go to contrib/spellchecker/src/java/org/apache/lucene/search/spell/LuceneDictionary.java and contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLuceneDictionary.java respectively. This is on top of the current lucene-trunk. Cheers Christian
        Hide
        lucenebugs@danielnaber.de Daniel Naber added a comment -

        Thanks for your patch. I think there's a problem with the iterator which might not occur often, but it should be fixed nonetheless: calling next() only has an effect if hasNext() has been called before. You can see that by commenting out "assertTrue("Second element doesn't exist.", it.hasNext());" in the test case: the test will then fail, although, to my understanding, hasNext() should have no side effects. Could you change you patch accordingly?

        Show
        lucenebugs@danielnaber.de Daniel Naber added a comment - Thanks for your patch. I think there's a problem with the iterator which might not occur often, but it should be fixed nonetheless: calling next() only has an effect if hasNext() has been called before. You can see that by commenting out "assertTrue("Second element doesn't exist.", it.hasNext());" in the test case: the test will then fail, although, to my understanding, hasNext() should have no side effects. Could you change you patch accordingly?
        Hide
        christian.mallwitz@xbridge.com Christian Mallwitz added a comment -

        New extended unit test case for class LuceneDictionary

        Show
        christian.mallwitz@xbridge.com Christian Mallwitz added a comment - New extended unit test case for class LuceneDictionary
        Hide
        christian.mallwitz@xbridge.com Christian Mallwitz added a comment -

        Fixed class LuceneDictionary

        Show
        christian.mallwitz@xbridge.com Christian Mallwitz added a comment - Fixed class LuceneDictionary
        Hide
        lucenebugs@danielnaber.de Daniel Naber added a comment -

        Thanks, patch applied.

        Show
        lucenebugs@danielnaber.de Daniel Naber added a comment - Thanks, patch applied.
        Hide
        steven_parkes Steven Parkes added a comment -

        Can we also update the javadocs to reflect the different semantics between terms() and terms(term)? Here's some possible verbage. (Also tweaks the "after the given term" which I think isn't correct?)

         
        Index: src/java/org/apache/lucene/index/IndexReader.java
        ===================================================================
        --- src/java/org/apache/lucene/index/IndexReader.java   (revision 543284)
        +++ src/java/org/apache/lucene/index/IndexReader.java   (working copy)
        @@ -539,16 +539,21 @@
             setNorm(doc, field, Similarity.encodeNorm(value));
           }
         
        -  /** Returns an enumeration of all the terms in the index.
        -   * The enumeration is ordered by Term.compareTo().  Each term
        -   * is greater than all that precede it in the enumeration.
        +  /** Returns an enumeration of all the terms in the index.  The
        +   * enumeration is ordered by Term.compareTo().  Each term is greater
        +   * than all that precede it in the enumeration.  Note that after
        +   * calling {@link #terms()}, {@link TermEnum#next()} must be called
        +   * on the resulting enumeration before calling other methods such as
        +   * {@link TermEnum#term()}.
            * @throws IOException if there is a low-level IO error
            */
           public abstract TermEnum terms() throws IOException;
         
        -  /** Returns an enumeration of all terms after a given term.
        -   * The enumeration is ordered by Term.compareTo().  Each term
        -   * is greater than all that precede it in the enumeration.
        +  /** Returns an enumeration of all terms starting at a given term. If
        +   * the given term does not exist, the enumeration is positioned a the
        +   * first term greater than the supplied therm.  The enumeration is
        +   * ordered by Term.compareTo().  Each term is greater than all that
        +   * precede it in the enumeration.
            * @throws IOException if there is a low-level IO error
            */
           public abstract TermEnum terms(Term t) throws IOException;
        
        Show
        steven_parkes Steven Parkes added a comment - Can we also update the javadocs to reflect the different semantics between terms() and terms(term)? Here's some possible verbage. (Also tweaks the "after the given term" which I think isn't correct?) Index: src/java/org/apache/lucene/index/IndexReader.java =================================================================== --- src/java/org/apache/lucene/index/IndexReader.java (revision 543284) +++ src/java/org/apache/lucene/index/IndexReader.java (working copy) @@ -539,16 +539,21 @@ setNorm(doc, field, Similarity.encodeNorm(value)); } - /** Returns an enumeration of all the terms in the index. - * The enumeration is ordered by Term.compareTo(). Each term - * is greater than all that precede it in the enumeration. + /** Returns an enumeration of all the terms in the index. The + * enumeration is ordered by Term.compareTo(). Each term is greater + * than all that precede it in the enumeration. Note that after + * calling {@link #terms()}, {@link TermEnum#next()} must be called + * on the resulting enumeration before calling other methods such as + * {@link TermEnum#term()}. * @throws IOException if there is a low-level IO error */ public abstract TermEnum terms() throws IOException; - /** Returns an enumeration of all terms after a given term. - * The enumeration is ordered by Term.compareTo(). Each term - * is greater than all that precede it in the enumeration. + /** Returns an enumeration of all terms starting at a given term. If + * the given term does not exist, the enumeration is positioned a the + * first term greater than the supplied therm. The enumeration is + * ordered by Term.compareTo(). Each term is greater than all that + * precede it in the enumeration. * @throws IOException if there is a low-level IO error */ public abstract TermEnum terms(Term t) throws IOException;
        Hide
        lucenebugs@danielnaber.de Daniel Naber added a comment -

        Thanks, Steven. Your javadoc changes have also been committed now.

        Show
        lucenebugs@danielnaber.de Daniel Naber added a comment - Thanks, Steven. Your javadoc changes have also been committed now.

          People

          • Assignee:
            Unassigned
            Reporter:
            dtertman Dan Ertman
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development