Lucene - Core
  1. Lucene - Core
  2. LUCENE-3854

Non-tokenized fields become tokenized when a document is deleted and added back

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.

      Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on.

      Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.

      So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.

      I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

        Activity

        Benson Margulies created issue -
        Benson Margulies made changes -
        Field Original Value New Value
        Description https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.

        Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on.

        Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.

        So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
        https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.

        Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on.

        Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.

        So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.

        I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

          People

          • Assignee:
            Unassigned
            Reporter:
            Benson Margulies
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development