Solr
  1. Solr
  2. SOLR-234

TrimFilter should update the start and end offsets

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: None
    • Labels:
      None

      Description

      As implemented, the TrimFilter only trims the text. It does not update the the startOffset and endOffset

      see:
      http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

      1. SOLR-234-TrimFilterOffsets.patch
        18 kB
        Ryan McKinley
      2. SOLR-234-TrimFilterOffsets.patch
        16 kB
        Ryan McKinley

        Activity

        Hide
        Ryan McKinley added a comment -

        This patch moves the start and end offests by how many spaces where eaten on either site.

        This patch also extracts many of the generally useful token testing bits from TestSynonymFilter (and a few others) into a BaseTokenTestCase, and uses this base class rather then duplicating the helper functions.

        Show
        Ryan McKinley added a comment - This patch moves the start and end offests by how many spaces where eaten on either site. This patch also extracts many of the generally useful token testing bits from TestSynonymFilter (and a few others) into a BaseTokenTestCase, and uses this base class rather then duplicating the helper functions.
        Hide
        Ryan McKinley added a comment -

        Chris points out:

        "in Lucene Token offset information is suppose to reflect exactly where in
        the orriginal stream of date the source of the token was found ... if hte
        token is modified in some way (ie: stemmed, trimmed, etc..) the offsets
        are suppose to remain the same becuase regardless of the token text
        munging, the orriginal location hsa not actually changed."

        I'll move the test refactoring to another issue as that is still useful

        Show
        Ryan McKinley added a comment - Chris points out: "in Lucene Token offset information is suppose to reflect exactly where in the orriginal stream of date the source of the token was found ... if hte token is modified in some way (ie: stemmed, trimmed, etc..) the offsets are suppose to remain the same becuase regardless of the token text munging, the orriginal location hsa not actually changed." I'll move the test refactoring to another issue as that is still useful
        Hide
        Yonik Seeley added a comment -

        Updating the offsets does seem like the right thing to do.

        I imagine using toCharArray() will be slower than using charAt() given that it will allocate a new array, and the number of charAt() calls will be low in the average case because there will only be a small amount of whitespace.

        Isn't it annoying that Java never seems to let you do things as efficiently as the class lib itself...

        Another issue here is that the position increment isn't maintained.
        And let another future issue is that any payloads aren't maintained (that's in a newer version of Lucene).
        I'll bring up the latter issue on the lucene list since I think it's a bit of a design flaw.

        Show
        Yonik Seeley added a comment - Updating the offsets does seem like the right thing to do. I imagine using toCharArray() will be slower than using charAt() given that it will allocate a new array, and the number of charAt() calls will be low in the average case because there will only be a small amount of whitespace. Isn't it annoying that Java never seems to let you do things as efficiently as the class lib itself... Another issue here is that the position increment isn't maintained. And let another future issue is that any payloads aren't maintained (that's in a newer version of Lucene). I'll bring up the latter issue on the lucene list since I think it's a bit of a design flaw.
        Hide
        Ryan McKinley added a comment -

        >
        > Updating the offsets does seem like the right thing to do.
        >

        My real use case is adding the the trim filter to the pattern tokenizer. the 'correct' answer in my case it to update the offsets.

        The case i can imagine leading to something like SOLR-42 is if a token is replaced with something that has leading or trailing spaces.

        Perhaps we could add a parameter to the factory:

        <filter class="solr.TrimFilterFactory" updateOffests="true" />

        To limit SOLR-42 style errors, the default could be false.

        >
        > Isn't it annoying that Java never seems to let you do things as efficiently as the class lib itself...
        >

        especially for strings!

        Show
        Ryan McKinley added a comment - > > Updating the offsets does seem like the right thing to do. > My real use case is adding the the trim filter to the pattern tokenizer. the 'correct' answer in my case it to update the offsets. The case i can imagine leading to something like SOLR-42 is if a token is replaced with something that has leading or trailing spaces. Perhaps we could add a parameter to the factory: <filter class="solr.TrimFilterFactory" updateOffests="true" /> To limit SOLR-42 style errors, the default could be false. > > Isn't it annoying that Java never seems to let you do things as efficiently as the class lib itself... > especially for strings!
        Hide
        Ryan McKinley added a comment -

        updated to take a n "updateOffsets" argument.

        this uses charAt() rather then toCharArray()

        What should happen with the position increment?

        if( start > 0 || end < txt.length() )

        { int incr = t.getPositionIncrement(); t = new Token( t.termText().substring( start, end ), t.startOffset()+start, t.endOffset()-endOff, t.type() ); t.setPositionIncrement( incr ); //+ start ); TODO? what should happen with the offset }
        Show
        Ryan McKinley added a comment - updated to take a n "updateOffsets" argument. this uses charAt() rather then toCharArray() What should happen with the position increment? if( start > 0 || end < txt.length() ) { int incr = t.getPositionIncrement(); t = new Token( t.termText().substring( start, end ), t.startOffset()+start, t.endOffset()-endOff, t.type() ); t.setPositionIncrement( incr ); //+ start ); TODO? what should happen with the offset }
        Hide
        Yonik Seeley added a comment -

        > What should happen with the position increment?
        It should remain the same as the original.

        > The case i can imagine leading to something like SOLR-42 is if a token is replaced with something
        > that has leading or trailing spaces.

        Really whacky, but possible I guess. I don't know of any token filters that would do that, unless someone explicitly used a synonym with spaces at the end. It doesn't make any sense.
        I'd think that updating the offsets is almost always the right thing to do (and should be the default?), given that spaces will almost always come from the field value itself.

        -Yonik

        Show
        Yonik Seeley added a comment - > What should happen with the position increment? It should remain the same as the original. > The case i can imagine leading to something like SOLR-42 is if a token is replaced with something > that has leading or trailing spaces. Really whacky, but possible I guess. I don't know of any token filters that would do that, unless someone explicitly used a synonym with spaces at the end. It doesn't make any sense. I'd think that updating the offsets is almost always the right thing to do (and should be the default?), given that spaces will almost always come from the field value itself. -Yonik
        Hide
        Hoss Man added a comment -

        > I'd think that updating the offsets is almost always the right thing to do (and should be the default?), given that spaces will
        > almost always come from the field value itself.

        i don't follow your reasoning ... the offsets are suppose to denote where in the original text the Token came from ... a Filter can't make any assumptions about source of the tokens except the token itself, so i don't' see why a Filter would by default assume it can muck with the offsets.

        In Ryan's use case he may want his highlighter-esque code to be able to know how many characters were trimmed off of each end – and i buy that it makes sense for TrimFilter to have an option to relay that info by modifying the offset – but joe random user should be able to expect that by default the offsets of the Tokens his tokenizer produces won't be modified ... i would personally think it's a bug to get the behavior ryan describes out of a highlighter if i knew that my tokenizer was only spliting on punctuation.

        Show
        Hoss Man added a comment - > I'd think that updating the offsets is almost always the right thing to do (and should be the default?), given that spaces will > almost always come from the field value itself. i don't follow your reasoning ... the offsets are suppose to denote where in the original text the Token came from ... a Filter can't make any assumptions about source of the tokens except the token itself, so i don't' see why a Filter would by default assume it can muck with the offsets. In Ryan's use case he may want his highlighter-esque code to be able to know how many characters were trimmed off of each end – and i buy that it makes sense for TrimFilter to have an option to relay that info by modifying the offset – but joe random user should be able to expect that by default the offsets of the Tokens his tokenizer produces won't be modified ... i would personally think it's a bug to get the behavior ryan describes out of a highlighter if i knew that my tokenizer was only spliting on punctuation.
        Hide
        Ryan McKinley added a comment -

        >
        > .. a Filter can't make any assumptions about source of the tokens except the token itself ...

        I get the basic pattern now: Tokenizers determin the start/end offsets and Filters just transform the text along the way.

        > In Ryan's use case he may want his highlighter-esque code to be able to know ...
        >

        I am fine with either:

        1. leave the TrimFilter as is and do the highlighter-esque code on the highlighting side.

        2. Add an optional updateOffsets="true" param, with the default set to "false"

        Show
        Ryan McKinley added a comment - > > .. a Filter can't make any assumptions about source of the tokens except the token itself ... I get the basic pattern now: Tokenizers determin the start/end offsets and Filters just transform the text along the way. > In Ryan's use case he may want his highlighter-esque code to be able to know ... > I am fine with either: 1. leave the TrimFilter as is and do the highlighter-esque code on the highlighting side. 2. Add an optional updateOffsets="true" param, with the default set to "false"
        Hide
        Yonik Seeley added a comment -

        offsets point back to the original field value for a particular token... and to me, it's a semantic contract (point to what makes sense in the source). It's not limited to the offsets generated by the Tokenizer... Analyzers don't have to use Tokenizers and TokenFilters at all.

        As an example, WordDelimiterFilter modifies offsets when it splits words, and that makese sense to me.

        Another way to think about it is that there is more than one way to solve a problem (construct an analyzer).
        What matters is the tokens that come out the end... not if I did
        a) a tokenizer that split on something followed by a filter that trimmed
        vs
        b) a tokenizer that managed to split on something including discarding the whitespace

        For this specific case, I think it comes down to the likely usecases for the filter, and an argument could be made either way. I'm fine with either as this is a very minor issue.

        Show
        Yonik Seeley added a comment - offsets point back to the original field value for a particular token... and to me, it's a semantic contract (point to what makes sense in the source). It's not limited to the offsets generated by the Tokenizer... Analyzers don't have to use Tokenizers and TokenFilters at all. As an example, WordDelimiterFilter modifies offsets when it splits words, and that makese sense to me. Another way to think about it is that there is more than one way to solve a problem (construct an analyzer). What matters is the tokens that come out the end... not if I did a) a tokenizer that split on something followed by a filter that trimmed vs b) a tokenizer that managed to split on something including discarding the whitespace For this specific case, I think it comes down to the likely usecases for the filter, and an argument could be made either way. I'm fine with either as this is a very minor issue.
        Hide
        Hoss Man added a comment -

        yeah ... i'm not saying there aren't good usecases both ways – it definitely makes sense to have an option – i'm just saying that as a general rule TokenFilters shouldn't be munging offsets ... i don't see a big difference between TrimFilter and StemmingFilter (where the the stem of "foo " and "foo " is "foo"). so the option should default to off.

        Show
        Hoss Man added a comment - yeah ... i'm not saying there aren't good usecases both ways – it definitely makes sense to have an option – i'm just saying that as a general rule TokenFilters shouldn't be munging offsets ... i don't see a big difference between TrimFilter and StemmingFilter (where the the stem of "foo " and "foo " is "foo"). so the option should default to off.
        Hide
        Ryan McKinley added a comment -

        Unless there are objections, I'd like to add "updateOffsets" as an option to the TrimFilter.

        By default it will not modify the offsets.

        Depending on how the Tokenizer+Analyzer stream is configured it may or may not make sense, so the option seems reasonable.

        Show
        Ryan McKinley added a comment - Unless there are objections, I'd like to add "updateOffsets" as an option to the TrimFilter. By default it will not modify the offsets. Depending on how the Tokenizer+Analyzer stream is configured it may or may not make sense, so the option seems reasonable.
        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked ("Resolved" or "Closed") and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.2

        The Fix Version for all 39 issues found was set to 1.2, email notification
        was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this
        (hopefully) unique string: 20080415hossman2

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.2 The Fix Version for all 39 issues found was set to 1.2, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman2

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Ryan McKinley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development