Solr
  1. Solr
  2. SOLR-37

Add additional configuration options for Highlighting

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: search
    • Labels:
      None

      Description

      As discussed in the mailing list, I've been looking at adding additional configuration options for highlighting.
      I've made quite a few changes to the properties for highlighting:

      Properties that can be set on request, or in solrconfig.xml at the top level:
      highlight (true/false)
      highlightFields
      Properties that can be set in solrconfig.xml at the top level or per-field
      formatter (simple/gradient)
      formatterPre (preTag for simple formatter)
      formatterPost (postTag for simple formatter)
      formatterMinFgCl (min foreground colour for gradient formatter)
      formatterMaxFgCl (max foreground colour for gradient formatter)
      formatterMinBgCl (min background colour for gradient formatter)
      formatterMaxBgCl (max background colour for gradient formatter)
      fragsize (if <=0 use NullFragmenter, otherwise use GapFragmenter with this value)

      I've added variables for these values to CommonParams, plus there's a fields Map<String,CommonParams> that is parsed from nested NamedLists (i.e. a lst named "fields", with a nested lst for each field).

      Here's a sample of how you can mix and match properties in solrconfig.xml:

      <requestHandler name="hl" class="solr.StandardRequestHandler" >
      <str name="formatter">simple</str>
      <str name="formatterPre"><i></str>
      <str name="formatterPost"></i></str>
      <str name="highlightFields">title,authors,journal</str>
      <int name="fragsize">0</int>
      <lst name="fields">
      <lst name="abstract">
      <str name="formatter">gradient</str>
      <str name="formatterMinBgCl">#FFFF99</str>
      <str name="formatterMaxBgCl">#FF9900</str>
      <int name="fragsize">30</int>
      <int name="maxSnippets">2</int>
      </lst>
      <lst name="authors">
      <str name="formatterPre"><strong></str>
      <str name="formatterPost"></strong></str>
      </lst>
      </lst>
      </requestHandler>

      I've created HighlightingUtils to handle most of the parameter parsing, but the hightlighting is still done in SolrPluginUtils and the doStandardHighlighting() method still has the same signature, but the other highlighting methods have had to be changed (because highlighters are now created per highlighted field).

      I'm not particularly happy with the code to pull parameters from CommonParams, first checking the field then falling back, e.g.:
      String pre = (params.fields.containsKey(fieldName) && params.fields.get(fieldName).formatterPre != null) ?
      params.fields.get(fieldName).formatterPre :
      params.formatterPre != null ? params.formatterPre : "<em>";

      I've removed support for a custom formatter - just choosing between simple/gradient. Probably that's a bad decision, but I wanted an easy way to choose between the standard formatters without having to invent a generic way of supplying arguments for the constructor. Perhaps there should be formatterType=simple/gradient and formatterClass=... which overrides formatterType if set at a lower level - with the formatterClass having to have a zero-args constructor? Note: gradient is actually SpanGradientFormatter.

      I'm not sure I properly understand how Fragmenters work, so supplying fragsize to GapFragmenter where >0 (instead of what was a default of 50) may not make sense.

      1. patch
        24 kB
        Andrew May
      2. patch
        23 kB
        Andrew May
      3. patch.diff
        35 kB
        Andrew May
      4. patch.diff
        34 kB
        Andrew May
      5. patch.diff
        35 kB
        Andrew May

        Activity

        Hide
        Andrew May added a comment -

        Changes to CommonParams, SolrPluginUtils, plus new HighlightingUtils

        Show
        Andrew May added a comment - Changes to CommonParams, SolrPluginUtils, plus new HighlightingUtils
        Hide
        Andrew May added a comment -

        New patch to configure Highlighting using new SolrParams.

        Parameters:
        hl - turn highlighting on/off
        hl.fl - list of fields to be highlighted, either as a single parameter (e.g. hl.fl=title,keywords) or multiple parameters (hl.fl=title&hl.fl=keywords)
        hl.snippets - maximum number of highlight snippets (default = 1)
        hl.fragsize - fragment size for highlighting, 0 -> NullFragmenter (default = 50)
        hl.formatter - value of either simple or gradient (default = simple)
        hl.simple.pre - simple formatter pre tag (default = <em>)
        hl.simple.post - simple formatter post tag (default = </em>)
        hl.gradient.minFg - gradient formatter min foreground colour
        hl.gradient.maxFg - gradient formatter max foreground colour
        hl.gradient.minBg - gradient formatter min background colour
        hl.gradient.maxBg - gradient formatter max background colour

        All values appart from hl & hl.fl can be specified per field. e.g. f.title.hl.fragsize=0.

        All the highlighting code is now in HighlighingUtils rather than SolrPluginUtils. Seems like I've ended up with one big doHighlighting method that does all the work - not sure that's a good thing, but things ended up this way when I started creating highlighters per field.

        Show
        Andrew May added a comment - New patch to configure Highlighting using new SolrParams. Parameters: hl - turn highlighting on/off hl.fl - list of fields to be highlighted, either as a single parameter (e.g. hl.fl=title,keywords) or multiple parameters (hl.fl=title&hl.fl=keywords) hl.snippets - maximum number of highlight snippets (default = 1) hl.fragsize - fragment size for highlighting, 0 -> NullFragmenter (default = 50) hl.formatter - value of either simple or gradient (default = simple) hl.simple.pre - simple formatter pre tag (default = <em>) hl.simple.post - simple formatter post tag (default = </em>) hl.gradient.minFg - gradient formatter min foreground colour hl.gradient.maxFg - gradient formatter max foreground colour hl.gradient.minBg - gradient formatter min background colour hl.gradient.maxBg - gradient formatter max background colour All values appart from hl & hl.fl can be specified per field. e.g. f.title.hl.fragsize=0. All the highlighting code is now in HighlighingUtils rather than SolrPluginUtils. Seems like I've ended up with one big doHighlighting method that does all the work - not sure that's a good thing, but things ended up this way when I started creating highlighters per field.
        Hide
        Yonik Seeley added a comment -

        ping test (4:24 EDT, 8/24/2006)

        Show
        Yonik Seeley added a comment - ping test (4:24 EDT, 8/24/2006)
        Hide
        Mike Klaas added a comment -

        Thanks for the patch!

        A few comments:

        1) The patch uses absolute paths, which makes it difficult to apply. Please generate patches using 'svn diff > mypatch.diff' at the root level of a solr trunk checkout

        2) I don't believe that it is necessary to add the constructors to GrapFragmenter--the existing constructors from SimpleFragmenter are equivalent.

        3) The default FRAGSIZE should be 100 to conform to Lucene's Highlighter default (it would be nice if this was exposed so we could use it)

        4) Might it be worth providing sensible defaults for the gradient values so users can try hl.formatter=gradient without futher configuration?

        5) There are a few constuctions of this form:
        + String param = getParams(request).getFieldParam(fieldName, SNIPPETS);
        + //the default of zero should never be used, as SNIPPETS is defined in the defaults
        + return (param != null) ? Integer.parseInt(param) : 0;

        where getParams returns a SolrParams instance which already has defaults for this parameter. Surely providing a default is unnecessary, and shouldn't null also be impossible due to the DefaultSolrParams construction? Inlining these (in the calling method) to something like

        SolrParams p = getParams(request)
        int maxSnippets = Integer.parseInt(p.getFieldParam(fieldName));

        would be cleaner and save some object construction costs.

        6) The last time this patch was discussed it was mentioned that there are tradeoffs in using field-specific idf values for highlighting scoring. One downside is that they must be read. Another is terms are only highlit in fields they are searched in, which may be desirable behaviour, but limits the usefulness of the hl.fl parameter. I'm not sure what the best approach is.

        7) The attached patch contains no tests, and further, though I have not applied the patch due to 1), I'm skeptical that the testsuite passes since the parameter names weren't changed in src/test/o/a/s/HighlighterTest.java. The latter is easy enough to fix, but new test should be included before this patch is committed.

        Thanks again for the patch!

        Show
        Mike Klaas added a comment - Thanks for the patch! A few comments: 1) The patch uses absolute paths, which makes it difficult to apply. Please generate patches using 'svn diff > mypatch.diff' at the root level of a solr trunk checkout 2) I don't believe that it is necessary to add the constructors to GrapFragmenter--the existing constructors from SimpleFragmenter are equivalent. 3) The default FRAGSIZE should be 100 to conform to Lucene's Highlighter default (it would be nice if this was exposed so we could use it) 4) Might it be worth providing sensible defaults for the gradient values so users can try hl.formatter=gradient without futher configuration? 5) There are a few constuctions of this form: + String param = getParams(request).getFieldParam(fieldName, SNIPPETS); + //the default of zero should never be used, as SNIPPETS is defined in the defaults + return (param != null) ? Integer.parseInt(param) : 0; where getParams returns a SolrParams instance which already has defaults for this parameter. Surely providing a default is unnecessary, and shouldn't null also be impossible due to the DefaultSolrParams construction? Inlining these (in the calling method) to something like SolrParams p = getParams(request) int maxSnippets = Integer.parseInt(p.getFieldParam(fieldName)); would be cleaner and save some object construction costs. 6) The last time this patch was discussed it was mentioned that there are tradeoffs in using field-specific idf values for highlighting scoring. One downside is that they must be read. Another is terms are only highlit in fields they are searched in, which may be desirable behaviour, but limits the usefulness of the hl.fl parameter. I'm not sure what the best approach is. 7) The attached patch contains no tests, and further, though I have not applied the patch due to 1), I'm skeptical that the testsuite passes since the parameter names weren't changed in src/test/o/a/s/HighlighterTest.java. The latter is easy enough to fix, but new test should be included before this patch is committed. Thanks again for the patch!
        Hide
        Andrew May added a comment -

        Mike,
        1) Was using SVN support in Eclipse and it doesn't let you control this - I've installed the command line version now.
        2) If GapFragmenter just has a default constructor it's not possible to pass the fragsize (constructors not inherited)?
        3) My mistake - I think this is a legacy of me originally confusing fragsize and GapFragmenter's increment threshold.
        4) Perhaps - I wasn't sure what sensible defaults were, and I can't seem to get the gradient fragmenter to do anything useful - but I can't see an obvious bug in my code to construct it.
        5) That was me being overly defensive, and I've changed it now. I considered adding methods like getFieldInt(), getFieldBool to SolrParams, so there were field versions of all the non-field methods - but decided against it.
        6) I was wondering whether a hl.exact flag might be useful - if false (the default?) the QueryScorer wouldn't be created with the field/maxTermWeight. I'm not sure there's any point using the gradient formatter unless you supply the field name/maxTermWeight, so that might ignore this setting.
        7) Sorry - will fix and add additional tests.

        One thing I don't like about HighlightingUtils as it stands is the lack of state. When highlighting multiple fields, getParams() is called many times, each time constructing a DefaultSolrParams (although I don't think there's a big overhead to doing this). If we're not specifying the field when creating the QueryScorer then this could be reused for multiple fields. We could possibly re-use the highlighter instance as well.

        Show
        Andrew May added a comment - Mike, 1) Was using SVN support in Eclipse and it doesn't let you control this - I've installed the command line version now. 2) If GapFragmenter just has a default constructor it's not possible to pass the fragsize (constructors not inherited)? 3) My mistake - I think this is a legacy of me originally confusing fragsize and GapFragmenter's increment threshold. 4) Perhaps - I wasn't sure what sensible defaults were, and I can't seem to get the gradient fragmenter to do anything useful - but I can't see an obvious bug in my code to construct it. 5) That was me being overly defensive, and I've changed it now. I considered adding methods like getFieldInt(), getFieldBool to SolrParams, so there were field versions of all the non-field methods - but decided against it. 6) I was wondering whether a hl.exact flag might be useful - if false (the default?) the QueryScorer wouldn't be created with the field/maxTermWeight. I'm not sure there's any point using the gradient formatter unless you supply the field name/maxTermWeight, so that might ignore this setting. 7) Sorry - will fix and add additional tests. One thing I don't like about HighlightingUtils as it stands is the lack of state. When highlighting multiple fields, getParams() is called many times, each time constructing a DefaultSolrParams (although I don't think there's a big overhead to doing this). If we're not specifying the field when creating the QueryScorer then this could be reused for multiple fields. We could possibly re-use the highlighter instance as well.
        Hide
        Mike Klaas added a comment -

        Hi Andrew,

        1) 3) 5) 7) thanks!
        2) You're right--I'm head is sometimes too steeped in pythonland.
        4) If no-one is using the gradient formatter, perhaps we shouldn't include it by default? There may be more intuitive ways of designing this feature if we have the benefit of a real world use case.
        6) "exact" could mean many things in the context of highlighting. Perhaps something like

        hl.scoring=simple (default)
        hl.scoring=fieldidf (per-field idf weights)

        Show
        Mike Klaas added a comment - Hi Andrew, 1) 3) 5) 7) thanks! 2) You're right--I'm head is sometimes too steeped in pythonland. 4) If no-one is using the gradient formatter, perhaps we shouldn't include it by default? There may be more intuitive ways of designing this feature if we have the benefit of a real world use case. 6) "exact" could mean many things in the context of highlighting. Perhaps something like hl.scoring=simple (default) hl.scoring=fieldidf (per-field idf weights)
        Hide
        Andrew May added a comment -

        New patch incorporating feedback (hopefully the diff is more usable this time).

        • default fragsize now 100
        • removed redundant defaults when getting fragsize and snippets
        • fixed tests and added new tests
        • added "Enable Highlighting" and "Fields to Highlight" to the advanced form in the admin pages

        The other change, which is more complex is to add a new "hl.exact" parameter (which defaults to false) which affects how the QueryScorer is created. The logic is now this:

        if using gradient formatter
        new QueryScorer(query, indexReader, fieldName)
        else if hl.exact=true
        new QueryScorer(query, fieldName)
        else
        new QueryScorer(query)

        My understanding is that the GradientFormatter requires the scorer to be created with IndexReader and field name to work properly, so using a gradient formatter for any field overrides the hl.exact flag.
        I've assumed that it's more efficient to create a QueryScorer that doesn't use an IndexReader in the case of hl.exact=true. If not then that could be rolled in with the gradient formatter case.
        Then the default behaviour is to create a QueryScorer without the field name and have less exact highlighting.

        Does that sound like reasonable behaviour?

        Ah - looks like I'm overlapping with a comment from Mike. I'm suggesting 'exact' because of how adding the fieldname to the QueryScorer affects searches across multiple fields - basically for what I'm using it for I don't want a value I searched for in the journal field appearing in the highlight for the title field (which was searched with something different) - so I would want hl.exact=true. But you're right - this is probably an overly broad term, and it is all about the scorer.

        As for removing the gradient highlighter - I still don't really know how it's supposed to work, and I can't get it to do anything useful when searching my data, but perhaps that's my configuration error (rather than a coding error in this patch). I'll probably end up using the simple formatter with custom pre and post values.

        Show
        Andrew May added a comment - New patch incorporating feedback (hopefully the diff is more usable this time). default fragsize now 100 removed redundant defaults when getting fragsize and snippets fixed tests and added new tests added "Enable Highlighting" and "Fields to Highlight" to the advanced form in the admin pages The other change, which is more complex is to add a new "hl.exact" parameter (which defaults to false) which affects how the QueryScorer is created. The logic is now this: if using gradient formatter new QueryScorer(query, indexReader, fieldName) else if hl.exact=true new QueryScorer(query, fieldName) else new QueryScorer(query) My understanding is that the GradientFormatter requires the scorer to be created with IndexReader and field name to work properly, so using a gradient formatter for any field overrides the hl.exact flag. I've assumed that it's more efficient to create a QueryScorer that doesn't use an IndexReader in the case of hl.exact=true. If not then that could be rolled in with the gradient formatter case. Then the default behaviour is to create a QueryScorer without the field name and have less exact highlighting. Does that sound like reasonable behaviour? Ah - looks like I'm overlapping with a comment from Mike. I'm suggesting 'exact' because of how adding the fieldname to the QueryScorer affects searches across multiple fields - basically for what I'm using it for I don't want a value I searched for in the journal field appearing in the highlight for the title field (which was searched with something different) - so I would want hl.exact=true. But you're right - this is probably an overly broad term, and it is all about the scorer. As for removing the gradient highlighter - I still don't really know how it's supposed to work, and I can't get it to do anything useful when searching my data, but perhaps that's my configuration error (rather than a coding error in this patch). I'll probably end up using the simple formatter with custom pre and post values.
        Hide
        Yonik Seeley added a comment -

        JIRA ping test (11:54 EDT, 8/26/2006)

        Show
        Yonik Seeley added a comment - JIRA ping test (11:54 EDT, 8/26/2006)
        Hide
        Mike Klaas added a comment -

        Does anyone else have comments on the two issues I raised, namely:

        1) should we include support for gradient formatting? (My opinion is "no", since no-one uses it and Andrew can't get it to work with his data so we have no functionality tests)

        2) the "exact" parameter. I'm uncomfortable about the name and the fact that setting a formatter (which feels to the user like a cosmetic setting) can influence core highlighting functionality.

        Once these items are resolved and re-reviewed I am in favour of committing the patch.

        Show
        Mike Klaas added a comment - Does anyone else have comments on the two issues I raised, namely: 1) should we include support for gradient formatting? (My opinion is "no", since no-one uses it and Andrew can't get it to work with his data so we have no functionality tests) 2) the "exact" parameter. I'm uncomfortable about the name and the fact that setting a formatter (which feels to the user like a cosmetic setting) can influence core highlighting functionality. Once these items are resolved and re-reviewed I am in favour of committing the patch.
        Hide
        Andrew May added a comment -

        I've spent a bit of time trying to understand Gradient formatting and how QueryScorer is used. As I didn't see any very good documentation for this (I may have missed it) - I thought I'd share.

        It appears that GradientFormatter colours according to the term's weight within the index - so terms that appear less frequently in the index will be coloured closer to the max foreground/background colour. So, the colour is not related to the specific document or fragment being evaluated and that term will be highlighted the same for the entire results set. If two terms appear with a similar frequency in the index they will have similar colours - and this seems to happen a lot (perhaps because scaling is done between 0 and maxWeight rather than minWeight and maxWeight).

        There's also a fairly serious bug in the colouring that makes a lot of combinations give meaningless results (e.g. minBg=#FF0000, maxBg=#00FF00 will give results coloured #FFFF00) - see GradientFormatter.getColorVal().

        In other words, I now agree with Mike that we should not support Gradient formatting. Perhaps we still want to retain the hl.formatter= parameter in case we have any other values than "simple" in the future - and keep hl.simple.pre and hl.simple.post as they are.

        As for the QueryScorer, I think it makes sense to support all three ways it can be construted:
        1) hl.scoring=simple (the default) - construct with Query only. May have some matches from other terms, but allows you to highlight different fields to the ones searched.
        2) hl.scoring=field - constructed with Query and fieldName. Only highlights terms matched in this field by the query.
        3) hl.scoring=fieldidx - constructed with Query, fieldName and IndexReader. I think the selection of the best fragment(s) will be improved because the terms will be weighted according to their frequency in the index - but this has to be more costly as it calls IndexReader.docFreq for each term.

        Does that sound reasonable?

        Show
        Andrew May added a comment - I've spent a bit of time trying to understand Gradient formatting and how QueryScorer is used. As I didn't see any very good documentation for this (I may have missed it) - I thought I'd share. It appears that GradientFormatter colours according to the term's weight within the index - so terms that appear less frequently in the index will be coloured closer to the max foreground/background colour. So, the colour is not related to the specific document or fragment being evaluated and that term will be highlighted the same for the entire results set. If two terms appear with a similar frequency in the index they will have similar colours - and this seems to happen a lot (perhaps because scaling is done between 0 and maxWeight rather than minWeight and maxWeight). There's also a fairly serious bug in the colouring that makes a lot of combinations give meaningless results (e.g. minBg=#FF0000, maxBg=#00FF00 will give results coloured #FFFF00) - see GradientFormatter.getColorVal(). In other words, I now agree with Mike that we should not support Gradient formatting. Perhaps we still want to retain the hl.formatter= parameter in case we have any other values than "simple" in the future - and keep hl.simple.pre and hl.simple.post as they are. As for the QueryScorer, I think it makes sense to support all three ways it can be construted: 1) hl.scoring=simple (the default) - construct with Query only. May have some matches from other terms, but allows you to highlight different fields to the ones searched. 2) hl.scoring=field - constructed with Query and fieldName. Only highlights terms matched in this field by the query. 3) hl.scoring=fieldidx - constructed with Query, fieldName and IndexReader. I think the selection of the best fragment(s) will be improved because the terms will be weighted according to their frequency in the index - but this has to be more costly as it calls IndexReader.docFreq for each term. Does that sound reasonable?
        Hide
        Andrew May added a comment -

        New patch that removes support for gradient formatter, and uses hl.scoring parameter instead of hl.exact to control how the QueryScorer is created.

        Show
        Andrew May added a comment - New patch that removes support for gradient formatter, and uses hl.scoring parameter instead of hl.exact to control how the QueryScorer is created.
        Hide
        Andrew May added a comment -

        I guess I need to number my patch files as they don't seem to get listed in the order they were added and I can't remove old ones. The new patch appears to be number 3.

        Show
        Andrew May added a comment - I guess I need to number my patch files as they don't seem to get listed in the order they were added and I can't remove old ones. The new patch appears to be number 3.
        Hide
        Andrew May added a comment -

        New patch.diff - removed hl.scoring=simple/field/fieldidx parameter and added hl.requireFieldMatch=true/false.

        Also added two getFieldBool methods to SolrParams - so that field parameters can use the same parseBool method (which also evaluates yes/on to true).

        Show
        Andrew May added a comment - New patch.diff - removed hl.scoring=simple/field/fieldidx parameter and added hl.requireFieldMatch=true/false. Also added two getFieldBool methods to SolrParams - so that field parameters can use the same parseBool method (which also evaluates yes/on to true).
        Hide
        Mike Klaas added a comment -

        Commited r439428

        Show
        Mike Klaas added a comment - Commited r439428
        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked ("Resolved" or "Closed") and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.1

        The Fix Version for all 38 issues found was set to 1.1, email notification
        was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this
        (hopefully) unique string: 20080415hossman3

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.1 The Fix Version for all 38 issues found was set to 1.1, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman3

          People

          • Assignee:
            Mike Klaas
            Reporter:
            Andrew May
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development