Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1105

Use a different stored field for highlighting

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • highlighter
    • None

    Description

      DefaultSolrHighlighter uses stored field content to highlight. It has some disadvantages, because index grows up fast when using multilingual indexing due to several fields has to be stored with same content. This patch allows DefaultSolrHighlighter to use "contentField" attribute to loockup content in external field.

      Excerpt from old schema:

      <field name="title" type="text" stored="true" indexed="true" />
      <field name="title_ru" type="text_ru" stored="true" indexed="true" />
      <field name="title_en" type="text_en" stored="true" indexed="true" />
      <field name="title_de" type="text_de" stored="true" indexed="true" />
      

      The same after patching, highlighter will now get content stored in "title" field

      <field name="title" type="text" stored="true" indexed="true" />
      <field name="title_ru" type="text_ru" stored="false" indexed="true" contentField="title"/>
      <field name="title_en" type="text_en" stored="false" indexed="true" contentField="title"/>
      <field name="title_de" type="text_de" stored="false" indexed="true" contentField="title"/>
      

      Attachments

        1. SOLR-1105_shared_content_field_1.3.0.patch
          4 kB
          Dmitry Lihachev
        2. SOLR-1105-1_4_1.patch
          4 kB
          Evgeniy Serykh
        3. SOLR-1105.patch
          29 kB
          Julien Martin

        Issue Links

          Activity

            Instead of baking this into the schema, should this be turned on/off through a request parameter?

            shalin Shalin Shekhar Mangar added a comment - Instead of baking this into the schema, should this be turned on/off through a request parameter?

            Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

            http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

            Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

            A unique token for finding these 240 issues in the future: hossversioncleanup20100527

            hossman Chris M. Hostetter added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
            sev Evgeniy Serykh added a comment - - edited

            fixed for solr 1.4.1

            use in solrconfig.xml:

            <str name="f.content_ru.hl.contentField">content</str>
            <str name="f.content_en.hl.contentField">content</str>
            
            <str name="f.title_ru.hl.contentField">title</str>
            <str name="f.title_en.hl.contentField">title</str>
            
            sev Evgeniy Serykh added a comment - - edited fixed for solr 1.4.1 use in solrconfig.xml: <str name= "f.content_ru.hl.contentField" > content </str> <str name= "f.content_en.hl.contentField" > content </str> <str name= "f.title_ru.hl.contentField" > title </str> <str name= "f.title_en.hl.contentField" > title </str>
            rcmuir Robert Muir added a comment -

            Bulk move 3.2 -> 3.3

            rcmuir Robert Muir added a comment - Bulk move 3.2 -> 3.3
            rcmuir Robert Muir added a comment -

            3.4 -> 3.5

            rcmuir Robert Muir added a comment - 3.4 -> 3.5

            Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

            email notification suppressed to prevent mass-spam
            psuedo-unique token identifying these issues: hoss20120321nofix36

            hossman Chris M. Hostetter added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
            sarowe Steven Rowe added a comment -

            Bulk move 4.4 issues to 4.5 and 5.0

            sarowe Steven Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
            uschindler Uwe Schindler added a comment -

            Move issue to Solr 4.9.

            uschindler Uwe Schindler added a comment - Move issue to Solr 4.9.
            dsmiley David Smiley added a comment -

            This would be very useful indeed.

            dsmiley David Smiley added a comment - This would be very useful indeed.
            mcaruanagalizia Matthew Caruana Galizia added a comment - - edited

            It also doesn't seem too difficult to implement on the UnifiedHighlighter, at least for someone who's familiar with the code.

            mcaruanagalizia Matthew Caruana Galizia added a comment - - edited It also doesn't seem too difficult to implement on the UnifiedHighlighter, at least for someone who's familiar with the code.
            Julm Julien Martin added a comment - - edited

            Here is a patch proposal (SOLR-1105.patch).

            Julm Julien Martin added a comment - - edited Here is a patch proposal ( SOLR-1105 .patch).
            dsmiley David Smiley added a comment -

            Thanks for contributing a patch Julien! I didn't thoroughly review it but one thing caught my attention – you added new parameters to the existing highlight methods on UnifiedHighlighter. I think this atypical use-case doesn't warrant that. Instead, notice much of the UH's configurability is from override-able methods on the UH.

            As an aside, I'm starting to wonder if there should be a "HighlightCommand" (or HighlightOptions) class that holds all the options (via subclassing) so that the UH needn't be subclassed to do 99% of use-cases.... I dunno. That's out of scope here of course. Assuming it's a separate source file, it would also help keep the sprawling UH source file in check. CC Timothy055

            Another issue I see is that (a) with this feature we want the ability highlight multiple fields yet potentially use the same stored field, and (b) in that case, we only want to load it once. It's not clear this patch takes that into consideration? Again; I didn't thoroughly look over the patch yet.

            dsmiley David Smiley added a comment - Thanks for contributing a patch Julien! I didn't thoroughly review it but one thing caught my attention – you added new parameters to the existing highlight methods on UnifiedHighlighter. I think this atypical use-case doesn't warrant that. Instead, notice much of the UH's configurability is from override-able methods on the UH. As an aside, I'm starting to wonder if there should be a "HighlightCommand" (or HighlightOptions) class that holds all the options (via subclassing) so that the UH needn't be subclassed to do 99% of use-cases.... I dunno. That's out of scope here of course. Assuming it's a separate source file, it would also help keep the sprawling UH source file in check. CC Timothy055 Another issue I see is that (a) with this feature we want the ability highlight multiple fields yet potentially use the same stored field, and (b) in that case, we only want to load it once. It's not clear this patch takes that into consideration? Again; I didn't thoroughly look over the patch yet.
            Julm Julien Martin added a comment -

            Thank you for looking at it David! We really need the feature over here

            As for unique field loading, my understanding is that the stored fields visitor pattern applied to the index searcher object ensures that no field is loaded twice per document.

            But this was a good point anyway because I had other issues with multiple fields highlighting which I solved in a new version of the patch you can find attached here.

            Sincerely,
            Julien

            Julm Julien Martin added a comment - Thank you for looking at it David! We really need the feature over here As for unique field loading, my understanding is that the stored fields visitor pattern applied to the index searcher object ensures that no field is loaded twice per document. But this was a good point anyway because I had other issues with multiple fields highlighting which I solved in a new version of the patch you can find attached here. Sincerely, Julien
            dsmiley David Smiley added a comment -

            I propose separating this issue into a Lucene portion and Solr portion. I have some thoughts on the Lucene side but I'll save it for later when you post that.

            I like the "hl.contentField" param name. You declared it in HighlightParams in a spot that I think should be down in the "misc" category (pretty minor).

            Why did you add a boolean flag for this to FieldProperties with the related modification to SchemaField accordingly?

            dsmiley David Smiley added a comment - I propose separating this issue into a Lucene portion and Solr portion. I have some thoughts on the Lucene side but I'll save it for later when you post that. I like the "hl.contentField" param name. You declared it in HighlightParams in a spot that I think should be down in the "misc" category (pretty minor). Why did you add a boolean flag for this to FieldProperties with the related modification to SchemaField accordingly?
            dsmiley David Smiley added a comment -

            Why did you add a boolean flag for this to FieldProperties with the related modification to SchemaField accordingly?

            Answering my own question... I could imagine that it's useful metadata for a non-stored field to declare that some other field is the source of it's indexed/analyzed text. But the schema already internally tracks copyField source/destination data. Maybe what we could do is have highlighting automatically work on a non-stored field when we see that the field to be highlighted is a copyField target? Then, in practice, most users wouldn't even need to specify hl.contentField (though as an explicit option, it's still nice).

            dsmiley David Smiley added a comment - Why did you add a boolean flag for this to FieldProperties with the related modification to SchemaField accordingly? Answering my own question... I could imagine that it's useful metadata for a non-stored field to declare that some other field is the source of it's indexed/analyzed text. But the schema already internally tracks copyField source/destination data. Maybe what we could do is have highlighting automatically work on a non-stored field when we see that the field to be highlighted is a copyField target? Then, in practice, most users wouldn't even need to specify hl.contentField (though as an explicit option, it's still nice).
            Julm Julien Martin added a comment -

            Thanks for your comments David.

            Automatic highlighting on copyField targets would be nice indeed.

            I just created the Lucene portion issue at https://issues.apache.org/jira/browse/LUCENE-7768

            Julm Julien Martin added a comment - Thanks for your comments David. Automatic highlighting on copyField targets would be nice indeed. I just created the Lucene portion issue at https://issues.apache.org/jira/browse/LUCENE-7768

            dsmiley wrote on SOLR-16111:

            Can this be used to solve https://issues.apache.org/jira/browse/SOLR-1105 ?

            Interesting question. Maybe, partly.

            So the hl.queryFieldPattern under SOLR-16111 can be used in a

            <field name="text_indexed_not_stored" type="text" indexed="true" stored="false"/>
            <field name="text_stored_not_indexed" type="text" stored="true" indexed="false"/>
            

            scenario e.g. if all documents are to be indexed but highlighting and thus storage is required only for a subset of documents.

            For a request

            q=text_indexed_not_stored:foo OR another_indexed_text_field:bar
            
            hl.queryFieldPattern=text_indexed_not_stored
            
            hl.fl=text_stored_not_indexed
            

            the foo term (but not the bar term) is to be extracted from the query and any foo within the text_stored_not_indexed is to be highlighted.

            In this foo/bar scenario the type is text for both fields i.e. the same whereas in the multi-lingual scenario the types differ. Okay, maybe an example would help think it through more:

            <field name="title"    type="text"    stored="true"  indexed="true"/>
            <field name="title_ru" type="text_ru" stored="false" indexed="true"/>
            <field name="title_en" type="text_en" stored="false" indexed="true"/>
            <field name="title_de" type="text_de" stored="false" indexed="true"/>
            

            and

            "document" : {
              "title"    : "hello hallo privyet", 
              "title_en" : "hello hallo privyet",
              "title_de" : "hello hallo privyet", 
              "title_ru" : "hello hallo privyet",
            }
            

            and

            q=title_en:hello OR title_de:hallo OR title_ru:privyet OR some_other_indexed_field:foobar
            
            hl.queryFieldPattern=title_*
            
            hl.fl=title
            

            as the hypothetical schema and document and query. So the terms should be correctly extracted but when highlighting on the generic title field, would it then depend on the exact analysis chain details and search terms w.r.t. whether or not all the terms are correctly highlighted?

            cpoerschke Christine Poerschke added a comment - dsmiley wrote on SOLR-16111 : Can this be used to solve https://issues.apache.org/jira/browse/SOLR-1105 ? Interesting question. Maybe, partly. So the hl.queryFieldPattern under SOLR-16111 can be used in a <field name= "text_indexed_not_stored" type= "text" indexed= " true " stored= " false " /> <field name= "text_stored_not_indexed" type= "text" stored= " true " indexed= " false " /> scenario e.g. if all documents are to be indexed but highlighting and thus storage is required only for a subset of documents. For a request q=text_indexed_not_stored:foo OR another_indexed_text_field:bar hl.queryFieldPattern=text_indexed_not_stored hl.fl=text_stored_not_indexed the foo term (but not the bar term) is to be extracted from the query and any foo within the text_stored_not_indexed is to be highlighted. In this foo/bar scenario the type is text for both fields i.e. the same whereas in the multi-lingual scenario the types differ. Okay, maybe an example would help think it through more: <field name= "title" type= "text" stored= " true " indexed= " true " /> <field name= "title_ru" type= "text_ru" stored= " false " indexed= " true " /> <field name= "title_en" type= "text_en" stored= " false " indexed= " true " /> <field name= "title_de" type= "text_de" stored= " false " indexed= " true " /> and "document" : { "title" : "hello hallo privyet" , "title_en" : "hello hallo privyet" , "title_de" : "hello hallo privyet" , "title_ru" : "hello hallo privyet" , } and q=title_en:hello OR title_de:hallo OR title_ru:privyet OR some_other_indexed_field:foobar hl.queryFieldPattern=title_* hl.fl=title as the hypothetical schema and document and query. So the terms should be correctly extracted but when highlighting on the generic title field, would it then depend on the exact analysis chain details and search terms w.r.t. whether or not all the terms are correctly highlighted?

            People

              dsmiley David Smiley
              dmitry.lihachev Dmitry Lihachev
              Votes:
              12 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated: