Lucene - Core
  1. Lucene - Core
  2. LUCENE-1590

Stored-only fields automatically enable norms and tf when added to document

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4, 2.4.1, 2.9
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      During updating my internal components to the new TrieAPI, I have seen the following:

      I index a lot of numeric fields with trie encoding omitting norms and term frequency. This works great. Luke shows that both is omitted.

      As I sometimes also want to have the components of the field stored and want to use the same field name for it. So I add additionally the field again to the document, but stored only (as the Field c'tor using a TokenStream cannot additionally store the field). As it is stored only, I thought, that I can left out explicit setting of omitNorms and omitTermFreqAndPositions. After adding the stored-only-without-omits field, Luke shows all fields with norms enabled. I am not sure, if the norms/tf were really added to the index, but Luke shows a value for the norms and FieldInfo has it enabled.

      In my opinion, this is not intuitive, o.a.l.document.Field should switch both omit* options on when storing fields only (and also disable other indexing-only options). Alternatively the internal FieldInfo.update(boolean isIndexed, boolean storeTermVector, boolean storePositionWithTermVector, boolean storeOffsetWithTermVector, boolean omitNorms, boolean storePayloads, boolean omitTermFreqAndPositions) should only change the omit* and other options, if the isIndexed parameter (not this.isIndexed) is also true, elsewhere leave it as it is.

      In principle, when adding a stored-only field, any indexing-specific options should not be changed in FieldInfo. If the field was indexed with norms before, norms should stay enabled (but this would be the default as it is).

      1. LUCENE-1590.patch
        13 kB
        Uwe Schindler
      2. LUCENE-1590.patch
        14 kB
        Uwe Schindler
      3. LUCENE-1590.patch
        6 kB
        Uwe Schindler
      4. LUCENE-1590.patch
        4 kB
        Uwe Schindler

        Activity

        Hide
        Michael McCandless added a comment -

        Uwe are you working out a patch for this?

        Show
        Michael McCandless added a comment - Uwe are you working out a patch for this?
        Hide
        Uwe Schindler added a comment -

        Here is it, not fully tested, but seems to work at least for norms and all Lucene Tests pass.
        The changes in Field could be left out, the important thing are FieldInfo cahnges.
        When a FieldInfo is generated without indexing switched on, all the indexing-only flags are set to defaults.

        The update method, that merges the existing field infos with new ones, only updates the indexing flags, if the added field is indexed. For all merging that results in an OR operation (all excl. omitNorms) the c'tors default is false, for all flags that merge with AND (omitNorms), the c'tors default is true.

        The problem is: Luke does not show the omitTf thing, as this flag seems not be loaded by IndexReader, so I cannot find out if the omitTf was really done in index (Luke does not show this flag even for fields that were added once with omitTf).

        Finally, I wanted to add a test, that exactly does what I have done, and tests if it works.

        Show
        Uwe Schindler added a comment - Here is it, not fully tested, but seems to work at least for norms and all Lucene Tests pass. The changes in Field could be left out, the important thing are FieldInfo cahnges. When a FieldInfo is generated without indexing switched on, all the indexing-only flags are set to defaults. The update method, that merges the existing field infos with new ones, only updates the indexing flags, if the added field is indexed. For all merging that results in an OR operation (all excl. omitNorms) the c'tors default is false, for all flags that merge with AND (omitNorms), the c'tors default is true. The problem is: Luke does not show the omitTf thing, as this flag seems not be loaded by IndexReader, so I cannot find out if the omitTf was really done in index (Luke does not show this flag even for fields that were added once with omitTf). Finally, I wanted to add a test, that exactly does what I have done, and tests if it works.
        Hide
        Uwe Schindler added a comment -

        The problem is: Luke does not show the omitTf thing, as this flag seems not be loaded by IndexReader, so I cannot find out if the omitTf was really done in index (Luke does not show this flag even for fields that were added once with omitTf).

        Ahhh, I found the problem: FieldsReader does not set this flag in the generated Field instances with the flag from FieldInfo. Patch follows shortly that fixes this, too. My Luke now displays this flag, too.

        Show
        Uwe Schindler added a comment - The problem is: Luke does not show the omitTf thing, as this flag seems not be loaded by IndexReader, so I cannot find out if the omitTf was really done in index (Luke does not show this flag even for fields that were added once with omitTf). Ahhh, I found the problem: FieldsReader does not set this flag in the generated Field instances with the flag from FieldInfo. Patch follows shortly that fixes this, too. My Luke now displays this flag, too.
        Hide
        Uwe Schindler added a comment -

        Here the patch that also fixes the missing omitTf settings in FieldsReader.

        I will create the TestCase tomorrow, it's to late for that, I want to go to bed now.

        Just a note: The behavior in Luke is with this patch a little bit different: Stored-Only fields now show in Luke always "omitNorms" (O), because this is the default for non-indexed fields. As soon as a field is added that is indexed and has not explicitely set omitNorms, norms are enabled again (as noted in the FieldInfo.update() method). So everything behaves correctly, only the display is now correct in Luke.

        Show
        Uwe Schindler added a comment - Here the patch that also fixes the missing omitTf settings in FieldsReader. I will create the TestCase tomorrow, it's to late for that, I want to go to bed now. Just a note: The behavior in Luke is with this patch a little bit different: Stored-Only fields now show in Luke always "omitNorms" (O), because this is the default for non-indexed fields. As soon as a field is added that is indexed and has not explicitely set omitNorms, norms are enabled again (as noted in the FieldInfo.update() method). So everything behaves correctly, only the display is now correct in Luke.
        Hide
        Uwe Schindler added a comment -

        Here is the final patch. I added two tests (one for the bug itsself) and one for the additional bug in FieldsReader, that does not propagate omitNorms and omitTf to the Fieldable on loading. I also found a bug in FieldInfos, that forgets to add omitTf if the FieldInfos are loaded from a Document instance (I realized this during changing the FieldsReader test).

        All core tests pass.

        It would be really good, if somebody with good knowledge of the internals could review the patch, as there are a lot of things, I do not understand completely.

        Show
        Uwe Schindler added a comment - Here is the final patch. I added two tests (one for the bug itsself) and one for the additional bug in FieldsReader, that does not propagate omitNorms and omitTf to the Fieldable on loading. I also found a bug in FieldInfos, that forgets to add omitTf if the FieldInfos are loaded from a Document instance (I realized this during changing the FieldsReader test). All core tests pass. It would be really good, if somebody with good knowledge of the internals could review the patch, as there are a lot of things, I do not understand completely.
        Hide
        Michael McCandless added a comment -

        Patch looks good! All tests pass. That was trickier than I expected;
        thanks Uwe. I plan to commit in a day or two.

        It's a good catch, all the places in FieldsReader where we fail to
        carryover OTFAP from FieldInfo --> Field instance on the document.
        It's yet another example of how having the loaded Document "seem like"
        the indexed document causes problems.

        In the ideal future (I think?), the fields on a "loaded" Document
        would make no effort to convey these index-time options like
        omitNorms, OTFAP, etc., because those settings are "semi-global"
        (absorbed into the FieldInfos for the current segment). And something
        like boost, which the API lets you access on a loaded doc, is always
        wrong since we cannot recreate that (it's not stored, directly, in the
        index).

        At indexing time, all these if's all over the place to conditionalize
        the defaults depending on whether the field is indexed, are also
        spooky. It's as if we should have a separate class (IndexedField)
        that privately carries these values. Then a StoredField wouldn't even
        have them. But that approach breaks down because we'd also want an
        IndexedAndStoredField.

        Or... perhaps we move all the indexing-specific settings out of
        Field.java and into Field.Index. After all, these details really
        describe tweaks on how Lucene will do its indexing, so they don't
        really belong in the main Field.java class.

        Show
        Michael McCandless added a comment - Patch looks good! All tests pass. That was trickier than I expected; thanks Uwe. I plan to commit in a day or two. It's a good catch, all the places in FieldsReader where we fail to carryover OTFAP from FieldInfo --> Field instance on the document. It's yet another example of how having the loaded Document "seem like" the indexed document causes problems. In the ideal future (I think?), the fields on a "loaded" Document would make no effort to convey these index-time options like omitNorms, OTFAP, etc., because those settings are "semi-global" (absorbed into the FieldInfos for the current segment). And something like boost, which the API lets you access on a loaded doc, is always wrong since we cannot recreate that (it's not stored, directly, in the index). At indexing time, all these if's all over the place to conditionalize the defaults depending on whether the field is indexed, are also spooky. It's as if we should have a separate class (IndexedField) that privately carries these values. Then a StoredField wouldn't even have them. But that approach breaks down because we'd also want an IndexedAndStoredField. Or... perhaps we move all the indexing-specific settings out of Field.java and into Field.Index. After all, these details really describe tweaks on how Lucene will do its indexing, so they don't really belong in the main Field.java class.
        Hide
        Uwe Schindler added a comment -

        Patch looks good! All tests pass. That was trickier than I expected;
        thanks Uwe. I plan to commit in a day or two.

        The only tricky part was the FieldsReader. The original bug was fixed in a few lines (FieldInfo ctor and update()).

        It's a good catch, all the places in FieldsReader where we fail to
        carryover OTFAP from FieldInfo --> Field instance on the document.
        It's yet another example of how having the loaded Document "seem like"
        the indexed document causes problems.

        I am still not happy with the new FieldReader because it cannot replicate all indexing infos (but now does almost everything). I know, it does not affect functionality (as only the stored contents can be retrieved). In principle the Field instances should have no indexing options. Luke would the display nothing anymore, but for this case it would really be better to make the Field infos "public", so somebody could enumerate all fields and test then, which options were used during indexing. Mixing this with retrieval of stored fields is not good.

        One case is now not implemented correctly in FieldsReader: A binary stored field have a special if-clause in FieldsReader. The binary field is loaded as stored only, currently only omitTf and omitNorms are set (I added this). But e.g. INDEX is always false and so on. In principle for completeness, all options from FieldInfo should be replicated here.
        FieldsReader would be better to have a central method like copyFieldOptions(FieldInfo, Fieldable), that copies all options from FieldInfo to the Fieldable (without looking at the stored contents). The other if-cases should only initialize the stored parts and type. I think, I give it a try.
        The whole info is now more important: If somebody in the past had stored the string contents compressed, he must now use a binary field and compress himself. In this case, Luke would not display any indexing options anymore. This is not bad, but inconsistent.

        So the better case is to make the Field properties public not on the document level, but on the IndexReader level.

        Show
        Uwe Schindler added a comment - Patch looks good! All tests pass. That was trickier than I expected; thanks Uwe. I plan to commit in a day or two. The only tricky part was the FieldsReader. The original bug was fixed in a few lines (FieldInfo ctor and update()). It's a good catch, all the places in FieldsReader where we fail to carryover OTFAP from FieldInfo --> Field instance on the document. It's yet another example of how having the loaded Document "seem like" the indexed document causes problems. I am still not happy with the new FieldReader because it cannot replicate all indexing infos (but now does almost everything). I know, it does not affect functionality (as only the stored contents can be retrieved). In principle the Field instances should have no indexing options. Luke would the display nothing anymore, but for this case it would really be better to make the Field infos "public", so somebody could enumerate all fields and test then, which options were used during indexing. Mixing this with retrieval of stored fields is not good. One case is now not implemented correctly in FieldsReader: A binary stored field have a special if-clause in FieldsReader. The binary field is loaded as stored only, currently only omitTf and omitNorms are set (I added this). But e.g. INDEX is always false and so on. In principle for completeness, all options from FieldInfo should be replicated here. FieldsReader would be better to have a central method like copyFieldOptions(FieldInfo, Fieldable), that copies all options from FieldInfo to the Fieldable (without looking at the stored contents). The other if-cases should only initialize the stored parts and type. I think, I give it a try. The whole info is now more important: If somebody in the past had stored the string contents compressed, he must now use a binary field and compress himself. In this case, Luke would not display any indexing options anymore. This is not bad, but inconsistent. So the better case is to make the Field properties public not on the document level, but on the IndexReader level.
        Hide
        Michael McCandless added a comment -

        In principle the Field instances should have no indexing options.

        You mean retrieved fields right? I agree.

        but for this case it would really be better to make the Field infos "public", so somebody could enumerate all fields and test then, which options were used during indexing. Mixing this with retrieval of stored fields is not good.

        I agree, we should make it possible to access the "schema"
        (FieldInfos) from the index. This would presumably replace the
        getFieldNames(FieldOption) IndexReader exposes today.

        Since FieldInfos is per-segment, one challenge is how Multi*Reader
        should work. Should it simply merge on-the-fly? (ie present a single
        FieldInfo that merged the fields by the same name across all segmens)

        FieldsReader would be better to have a central method like copyFieldOptions(FieldInfo, Fieldable), that copies all options from FieldInfo to the Fieldable (without looking at the stored contents).

        This sounds like a good stop-gap measure, but I'd rather put our
        energy towards exposing the schema, decoupling "retrieved" Fields from
        indexed fields, etc.

        Show
        Michael McCandless added a comment - In principle the Field instances should have no indexing options. You mean retrieved fields right? I agree. but for this case it would really be better to make the Field infos "public", so somebody could enumerate all fields and test then, which options were used during indexing. Mixing this with retrieval of stored fields is not good. I agree, we should make it possible to access the "schema" (FieldInfos) from the index. This would presumably replace the getFieldNames(FieldOption) IndexReader exposes today. Since FieldInfos is per-segment, one challenge is how Multi*Reader should work. Should it simply merge on-the-fly? (ie present a single FieldInfo that merged the fields by the same name across all segmens) FieldsReader would be better to have a central method like copyFieldOptions(FieldInfo, Fieldable), that copies all options from FieldInfo to the Fieldable (without looking at the stored contents). This sounds like a good stop-gap measure, but I'd rather put our energy towards exposing the schema, decoupling "retrieved" Fields from indexed fields, etc.
        Hide
        Uwe Schindler added a comment -

        Since FieldInfos is per-segment, one challenge is how Multi*Reader should work. Should it simply merge on-the-fly? (ie present a single FieldInfo that merged the fields by the same name across all segmens)

        Maybe merge with the already existing FieldInfos/FieldInfo methods.

        A new case for this would be good, after thinking a little bit about it, I may open one. But in general it should be combined with the Document/Fields redesign.

        This sounds like a good stop-gap measure, but I'd rather put our energy towards exposing the schema, decoupling "retrieved" Fields from indexed fields, etc.

        Yes, and it will not work. I think, we leave the patch as it is, maybe remove the omitTf and omitNorms update for binary fields. Binary fields are "special".

        Show
        Uwe Schindler added a comment - Since FieldInfos is per-segment, one challenge is how Multi*Reader should work. Should it simply merge on-the-fly? (ie present a single FieldInfo that merged the fields by the same name across all segmens) Maybe merge with the already existing FieldInfos/FieldInfo methods. A new case for this would be good, after thinking a little bit about it, I may open one. But in general it should be combined with the Document/Fields redesign. This sounds like a good stop-gap measure, but I'd rather put our energy towards exposing the schema, decoupling "retrieved" Fields from indexed fields, etc. Yes, and it will not work. I think, we leave the patch as it is, maybe remove the omitTf and omitNorms update for binary fields. Binary fields are "special".
        Hide
        Michael McCandless added a comment -

        Maybe merge with the already existing FieldInfos/FieldInfo methods.

        And we should think about flexible indexing, ie, make FieldInfo extensible. I think there are two separate questions, here:

        • What API doe we expose for the "schema" (FieldInfo/s)?
        • How to handle the fact that each segment has its own "schema" (hide it, by virtually merging the way SegmentMerger would, or, expose it)?

        A new case for this would be good, after thinking a little bit about it, I may open one. But in general it should be combined with the Document/Fields redesign.

        I agree, a new issue.

        Yes, and it will not work. I think, we leave the patch as it is, maybe remove the omitTf and omitNorms update for binary fields. Binary fields are "special".

        Do you want to make a new patch (removing omit* update for binary fields)?

        Show
        Michael McCandless added a comment - Maybe merge with the already existing FieldInfos/FieldInfo methods. And we should think about flexible indexing, ie, make FieldInfo extensible. I think there are two separate questions, here: What API doe we expose for the "schema" (FieldInfo/s)? How to handle the fact that each segment has its own "schema" (hide it, by virtually merging the way SegmentMerger would, or, expose it)? A new case for this would be good, after thinking a little bit about it, I may open one. But in general it should be combined with the Document/Fields redesign. I agree, a new issue. Yes, and it will not work. I think, we leave the patch as it is, maybe remove the omitTf and omitNorms update for binary fields. Binary fields are "special". Do you want to make a new patch (removing omit* update for binary fields)?
        Hide
        Uwe Schindler added a comment -

        Do you want to make a new patch (removing omit* update for binary fields)?

        Here it is, FieldsReader is now equal to the pre-previous patch, tests still pass. Binary field handling is now unchanged.

        Show
        Uwe Schindler added a comment - Do you want to make a new patch (removing omit* update for binary fields)? Here it is, FieldsReader is now equal to the pre-previous patch, tests still pass. Binary field handling is now unchanged.
        Hide
        Uwe Schindler added a comment -

        I forgot to add a change-note in changes.txt. As the behaviour is now a little bit different (stored-only fields now have omitNorms switched "on" per default), there should be a note in changes.txt. Can you add it when committing. I am not sure, what to add there.

        Show
        Uwe Schindler added a comment - I forgot to add a change-note in changes.txt. As the behaviour is now a little bit different (stored-only fields now have omitNorms switched "on" per default), there should be a note in changes.txt. Can you add it when committing. I am not sure, what to add there.
        Hide
        Michael McCandless added a comment -

        Thanks Uwe. I'll add a CHANGES entry.

        Show
        Michael McCandless added a comment - Thanks Uwe. I'll add a CHANGES entry.
        Hide
        Michael McCandless added a comment -

        Thanks Uwe!

        Show
        Michael McCandless added a comment - Thanks Uwe!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development