Solr
  1. Solr
  2. SOLR-2890

omitTermFreqAndPositions and omitNorms don't work properly when used on fieldTypes

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.4
    • Fix Version/s: 4.1, 5.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Setting omitTermFreqAndPositions="true" doesn't work when I put it on a fieldType definition for my text field. It did work when I put it on the field definition. I think this option and probably all options should be settable at the fieldType level. I did some investigation and found that the value of this option was being reset on line 54 of TextField.

      FYI I am trying to put this on a field type for use by the SpellCheck component which has no use for term frequencies and positions from the source field.

        Issue Links

          Activity

          Hide
          Andy Lester added a comment -

          I believe this is a Bug, not an Improvement, and that it is not Minor.

          The docs at http://wiki.apache.org/solr/SchemaXml explicitly state that "Common options that field types can have are..." and lists omitTermFreqAndPositions.

          In my case, I created a custom type for ISBNs specified like so:

          <fieldType name="isbn" class="solr.TextField" stored="true" sortMissingLast="true" omitNorms="true" omitTermFreqAndPositions="true">
          <analyzer>
          <!-- Remove anything not a digit or X -->
          <charFilter class="solr.PatternReplaceCharFilterFactory"
          pattern="[^0-9Xx]"
          replacement=""
          replace="all"
          />
          <tokenizer class="solr.KeywordTokenizerFactory" />
          <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
          </fieldType>

          with the field definition like so:

          <field name="isbn" type="isbn" omitTermFreqAndPositions="true" multiValued="true" />

          It was surprising, then, to find that my core's index directory had 600MB of *.prx files, when there should not be any position information anywhere in the core.

          When I then updated the field definition to:

          <field name="isbn" type="isbn" omitTermFreqAndPositions="true" multiValued="true" />

          and reindexed the core, the *.prx files were no longer created.

          Based on David Smiley's reading of the code at in TextField.java, the culprit seems to be:

          if (schema.getVersion()> 1.1f) properties &= ~OMIT_TF_POSITIONS;

          which is at least reassuring that omitNorms and omitPositions seem to be unchanged.

          The fix to this could be as simple as changing the wiki to state that omitTermFreqAndPositions must be specified at the field level.

          Show
          Andy Lester added a comment - I believe this is a Bug, not an Improvement, and that it is not Minor. The docs at http://wiki.apache.org/solr/SchemaXml explicitly state that "Common options that field types can have are..." and lists omitTermFreqAndPositions. In my case, I created a custom type for ISBNs specified like so: <fieldType name="isbn" class="solr.TextField" stored="true" sortMissingLast="true" omitNorms="true" omitTermFreqAndPositions="true"> <analyzer> <!-- Remove anything not a digit or X --> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern=" [^0-9Xx] " replacement="" replace="all" /> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> with the field definition like so: <field name="isbn" type="isbn" omitTermFreqAndPositions="true" multiValued="true" /> It was surprising, then, to find that my core's index directory had 600MB of *.prx files, when there should not be any position information anywhere in the core. When I then updated the field definition to: <field name="isbn" type="isbn" omitTermFreqAndPositions="true" multiValued="true" /> and reindexed the core, the *.prx files were no longer created. Based on David Smiley's reading of the code at in TextField.java, the culprit seems to be: if (schema.getVersion()> 1.1f) properties &= ~OMIT_TF_POSITIONS; which is at least reassuring that omitNorms and omitPositions seem to be unchanged. The fix to this could be as simple as changing the wiki to state that omitTermFreqAndPositions must be specified at the field level.
          Hide
          David Smiley added a comment -

          --then change the wiki

          But this is still a problem, sure. It's "minor" because the work-around is trivial.

          Minor aside: You should use a regex at the TokenFilter level, not the CharFilter. It's a bit slower at the CharFilter and there might be problems with highlighting if you use the regex to change the field length. CharFilters are designed to work in advance of the Tokenizer when you need to modify what the Tokenizer sees. There will never be such a problem with KeywordTokenizer.

          Show
          David Smiley added a comment - --then change the wiki But this is still a problem, sure. It's "minor" because the work-around is trivial. Minor aside: You should use a regex at the TokenFilter level, not the CharFilter. It's a bit slower at the CharFilter and there might be problems with highlighting if you use the regex to change the field length. CharFilters are designed to work in advance of the Tokenizer when you need to modify what the Tokenizer sees. There will never be such a problem with KeywordTokenizer.
          Hide
          Yonik Seeley added a comment -

          IIRC, I think the intent was to exclude positions by default for all the field types that didn't need them (except for text field which by default would).
          If the option is set on a fieldType, it should become the default for any fields using that fieldType.

          Show
          Yonik Seeley added a comment - IIRC, I think the intent was to exclude positions by default for all the field types that didn't need them (except for text field which by default would). If the option is set on a fieldType, it should become the default for any fields using that fieldType.
          Hide
          Hoss Man added a comment -

          This seems like a really bad bug for two reasons:

          1) even if there is a trivial work around, it's the kind of thing that most users aren't going to be savvy enough to even realize isn't working properly (ie: it has no obvious "ERROR") ... you really have to go out of your way to discover that the extra data is in your index even though you asked for it not to be.

          2) it appears to have been broken for years and yet none of the tests anyone has written in that time have managed to tickle it to make any one notice.

          So i spent a bit of time trying to write an exhaustive test of the way all the diff version specific default props work, to prove that the defaults did what they should, and that overriding them did what it should – which lead me to discover there is a similar problem with omitNorms on fieldTYpes.

          I'm updating the summary to note this for future searchers, and i'll attach my patch with test and fixes for review

          Show
          Hoss Man added a comment - This seems like a really bad bug for two reasons: 1) even if there is a trivial work around, it's the kind of thing that most users aren't going to be savvy enough to even realize isn't working properly (ie: it has no obvious "ERROR") ... you really have to go out of your way to discover that the extra data is in your index even though you asked for it not to be. 2) it appears to have been broken for years and yet none of the tests anyone has written in that time have managed to tickle it to make any one notice. So i spent a bit of time trying to write an exhaustive test of the way all the diff version specific default props work, to prove that the defaults did what they should, and that overriding them did what it should – which lead me to discover there is a similar problem with omitNorms on fieldTYpes. I'm updating the summary to note this for future searchers, and i'll attach my patch with test and fixes for review
          Hide
          Hoss Man added a comment -

          patch with fix & tests

          Show
          Hoss Man added a comment - patch with fix & tests
          Hide
          David Smiley added a comment -

          +1 Great work Hoss! I'm sure developing that test was non-trivial.

          Show
          David Smiley added a comment - +1 Great work Hoss! I'm sure developing that test was non-trivial.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Chris M. Hostetter
          http://svn.apache.org/viewvc?view=revision&revision=1415817

          SOLR-2890: Fixed a bug that prevented omitNorms and omitTermFreqAndPositions options from being respected in some <fieldType/> declarations

          Show
          Commit Tag Bot added a comment - [trunk commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1415817 SOLR-2890 : Fixed a bug that prevented omitNorms and omitTermFreqAndPositions options from being respected in some <fieldType/> declarations
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Chris M. Hostetter
          http://svn.apache.org/viewvc?view=revision&revision=1415837

          SOLR-2890: Fixed a bug that prevented omitNorms and omitTermFreqAndPositions options from being respected in some <fieldType/> declarations (merge r1415817)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1415837 SOLR-2890 : Fixed a bug that prevented omitNorms and omitTermFreqAndPositions options from being respected in some <fieldType/> declarations (merge r1415817)
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Chris M. Hostetter
          http://svn.apache.org/viewvc?view=revision&revision=1415837

          SOLR-2890: Fixed a bug that prevented omitNorms and omitTermFreqAndPositions options from being respected in some <fieldType/> declarations (merge r1415817)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1415837 SOLR-2890 : Fixed a bug that prevented omitNorms and omitTermFreqAndPositions options from being respected in some <fieldType/> declarations (merge r1415817)

            People

            • Assignee:
              Hoss Man
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development