Lucene - Core
  1. Lucene - Core
  2. LUCENE-4880

Difference in offset handling between IndexReader created by MemoryIndex and one created by RAMDirectory

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.2
    • Fix Version/s: 4.3, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Environment:

      Windows 7 (probably irrelevant)

    • Lucene Fields:
      New

      Description

      MemoryIndex skips tokens that have length == 0 when building the index; the result is that it does not increment the token offset (nor does it store the position offsets if that option is set) for tokens of length == 0. A regular index (via, say, RAMDirectory) does not appear to do this.

      When using the ICUFoldingFilter, it is possible to have a term of zero length (the \u0640 character separated by spaces). If that occurs in a document, the offsets returned at search time differ between the MemoryIndex and a regular index.

      1. LUCENE-4880.patch
        4 kB
        Robert Muir
      2. MemoryIndexVsRamDirZeroLengthTermTest.java
        7 kB
        Tim Allison

        Activity

        Hide
        Robert Muir added a comment -

        Thanks for raising this Timothy.

        I think its a bug in MemoryIndex: it shouldn't skip terms that are of zero length.

        Show
        Robert Muir added a comment - Thanks for raising this Timothy. I think its a bug in MemoryIndex: it shouldn't skip terms that are of zero length.
        Hide
        Uwe Schindler added a comment - - edited

        Yes, this is a bug in MemoryIndex. In earlier Lucene versions I think we skipped empty terms in standard IndexWriter, but thats no longer the case. So MemoryIndex must be consistent.

        Show
        Uwe Schindler added a comment - - edited Yes, this is a bug in MemoryIndex. In earlier Lucene versions I think we skipped empty terms in standard IndexWriter, but thats no longer the case. So MemoryIndex must be consistent.
        Hide
        Robert Muir added a comment -

        I also think its stupid you get 0640 as a token by itself in any case. I dont agree with the unicode property of "letter" for this character as that doesnt makes sense to me, in my opinion it should be "format". I sure hope there is some good reason for this, but to me its crazy.

        Show
        Robert Muir added a comment - I also think its stupid you get 0640 as a token by itself in any case. I dont agree with the unicode property of "letter" for this character as that doesnt makes sense to me, in my opinion it should be "format". I sure hope there is some good reason for this, but to me its crazy.
        Hide
        Robert Muir added a comment -

        Attached is a fix with tests.

        Show
        Robert Muir added a comment - Attached is a fix with tests.
        Hide
        Robert Muir added a comment -

        Thanks Timothy!

        Show
        Robert Muir added a comment - Thanks Timothy!
        Hide
        Tim Allison added a comment -

        Thank you!

        Show
        Tim Allison added a comment - Thank you!
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development