Solr
  1. Solr
  2. SOLR-1710

convert worddelimiterfilter to new tokenstream API

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      This one was a doozy, attached is a patch to convert it to the new tokenstream API.

      Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
      the filter is much more efficient now, no cloning.

      before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
      the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
      For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.

      NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

      1. SOLR-1710.patch
        48 kB
        Robert Muir
      2. SOLR-1710.patch
        48 kB
        Robert Muir
      3. SOLR-1710-readable.patch
        59 kB
        Chris Male
      4. SOLR-1710-readable.patch
        59 kB
        Chris Male

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          for the 'wdf is only modifying single word with punctuation', don't clearAttributes() if its the first token, even though its modified... unless preserveOriginal is on (in this case the preserved original contained the attributes already, and we must clear).

          this is a little confusing since the behavior for custom attributes depends on this preserveOriginal value, but i think it makes sense.

          Show
          Robert Muir added a comment - for the 'wdf is only modifying single word with punctuation', don't clearAttributes() if its the first token, even though its modified... unless preserveOriginal is on (in this case the preserved original contained the attributes already, and we must clear). this is a little confusing since the behavior for custom attributes depends on this preserveOriginal value, but i think it makes sense.
          Hide
          Yonik Seeley added a comment -

          For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters

          Whew... nice thorough work.

          Show
          Yonik Seeley added a comment - For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters Whew... nice thorough work.
          Hide
          Robert Muir added a comment -

          Yonik, thanks. Again i have a hesitation: the SOLR-1706 problem.

          If i could fix this bug in the original code, i would be able to enable the problematic combinations in backwards testing:

          • catenateNumbers != catenateWords
          • generateWordParts != generateNumberParts

          I was unable to figure this one out though, so excluding these from the test makes me a little nervous... what is there to do?

          Show
          Robert Muir added a comment - Yonik, thanks. Again i have a hesitation: the SOLR-1706 problem. If i could fix this bug in the original code, i would be able to enable the problematic combinations in backwards testing: catenateNumbers != catenateWords generateWordParts != generateNumberParts I was unable to figure this one out though, so excluding these from the test makes me a little nervous... what is there to do?
          Hide
          Chris Male added a comment -

          Hi,

          I notice in the patch that it references OriginalWordDelimiterFilter in TestWordDelimiterBWComp. Is this an error?

          Cheers

          Show
          Chris Male added a comment - Hi, I notice in the patch that it references OriginalWordDelimiterFilter in TestWordDelimiterBWComp. Is this an error? Cheers
          Hide
          Robert Muir added a comment -

          Chris, not really, if you see the description i say:
          before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter

          I guess this should say instead: make a copy of... I will fix.

          obviously OriginalWordDelimiterFilter should not be committed, nor this random test that compares results against it.

          but for now its convenient while working the issue to simply blast random strings against the old filter for testing.

          Show
          Robert Muir added a comment - Chris, not really, if you see the description i say: before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter I guess this should say instead: make a copy of... I will fix. obviously OriginalWordDelimiterFilter should not be committed, nor this random test that compares results against it. but for now its convenient while working the issue to simply blast random strings against the old filter for testing.
          Hide
          Chris Male added a comment -

          Ah right, sorry missed that description.

          Show
          Chris Male added a comment - Ah right, sorry missed that description.
          Hide
          Robert Muir added a comment -

          Chris, no problem, I created this confusion until the patch is OK'ed.

          once this happens, i can include some additional testcases that I had problems with.
          i have all 7 revisions i made of this filter locally so i can see which scenarios fail on each previous iteration, I think these are good tests.

          Show
          Robert Muir added a comment - Chris, no problem, I created this confusion until the patch is OK'ed. once this happens, i can include some additional testcases that I had problems with. i have all 7 revisions i made of this filter locally so i can see which scenarios fail on each previous iteration, I think these are good tests.
          Hide
          Chris Male added a comment -

          I am working with this patch with the goal of simplifying its logic and increasing readability. Seems great thus far though.

          Show
          Chris Male added a comment - I am working with this patch with the goal of simplifying its logic and increasing readability. Seems great thus far though.
          Hide
          Robert Muir added a comment -

          thanks in advance chris, I will help with testing and benchmarking anything you can do.
          I think i may have taken it as far as I can go, my head almost exploded.

          Show
          Robert Muir added a comment - thanks in advance chris, I will help with testing and benchmarking anything you can do. I think i may have taken it as far as I can go, my head almost exploded.
          Hide
          Chris Male added a comment -

          Just wondering what the return type of WordDelimiterIterator#next() supposed to indicate? I see that it either returns the end index, or DONE but this value never seems to be used by the filter. Does it have a role?

          Show
          Chris Male added a comment - Just wondering what the return type of WordDelimiterIterator#next() supposed to indicate? I see that it either returns the end index, or DONE but this value never seems to be used by the filter. Does it have a role?
          Hide
          Robert Muir added a comment -

          chris yeah, its supposed to be similar to http://java.sun.com/j2se/1.4.2/docs/api/java/text/BreakIterator.html#next%28%29

          i started by mimicing this api somewhat, i guess a future improvement would be if somehow this truly was a real BreakIterator.
          Then say, you could create a RuleBasedBreakIterator or DictionaryBasedBreakIterator (which are fast compiled DFAs), and customize how words are delimited.
          currently, you can only do this with by customizing the charTypeTable, which cannot take any context into account, so its rather limited.

          all of the above is really just theoretical and not anything we should worry about, for practical purposes i mimiced BreakIterator api (but diverged somewhat), just because I am used to working with it and found it was one way to separate a lot of the logic.

          Show
          Robert Muir added a comment - chris yeah, its supposed to be similar to http://java.sun.com/j2se/1.4.2/docs/api/java/text/BreakIterator.html#next%28%29 i started by mimicing this api somewhat, i guess a future improvement would be if somehow this truly was a real BreakIterator. Then say, you could create a RuleBasedBreakIterator or DictionaryBasedBreakIterator (which are fast compiled DFAs), and customize how words are delimited. currently, you can only do this with by customizing the charTypeTable, which cannot take any context into account, so its rather limited. all of the above is really just theoretical and not anything we should worry about, for practical purposes i mimiced BreakIterator api (but diverged somewhat), just because I am used to working with it and found it was one way to separate a lot of the logic.
          Hide
          Chris Male added a comment -

          Attaching a first pass at improving the readability of this code.

          Focused mostly on breaking up #incrementToken, extracting common behavior into helper methods, documenting each method, putting fields in a consistent place, trimming if else statement blocks etc etc.

          I imagine there might be a small performance improvement due to these improvements, but they could have all been done by the compiler too.

          Show
          Chris Male added a comment - Attaching a first pass at improving the readability of this code. Focused mostly on breaking up #incrementToken, extracting common behavior into helper methods, documenting each method, putting fields in a consistent place, trimming if else statement blocks etc etc. I imagine there might be a small performance improvement due to these improvements, but they could have all been done by the compiler too.
          Hide
          Chris Male added a comment -

          Updated patch with method name changes. doXYZ is now shouldXYZ and writeClear is now writeAndClear

          Show
          Chris Male added a comment - Updated patch with method name changes. doXYZ is now shouldXYZ and writeClear is now writeAndClear
          Hide
          Robert Muir added a comment -

          This was resolved in revision 922957.

          Show
          Robert Muir added a comment - This was resolved in revision 922957.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release

            People

            • Assignee:
              Mark Miller
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development