Solr
  1. Solr
  2. SOLR-1657

convert the rest of solr to use the new tokenstream API

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      org.apache.solr.analysis:
      BufferedTokenStream
      > -CommonGramsFilter
      > -CommonGramsQueryFilter
      > -RemoveDuplicatesTokenFilter
      CapitalizationFilterFactory
      HyphenatedWordsFilter
      LengthFilter (deprecated, remove)
      SynonymFilter
      SynonymFilterFactory
      WordDelimiterFilter

      org.apache.solr.handler:
      AnalysisRequestHandler
      AnalysisRequestHandlerBase

      org.apache.solr.handler.component:
      QueryElevationComponent
      SpellCheckComponent

      org.apache.solr.highlight:
      DefaultSolrHighlighter

      org.apache.solr.spelling:
      SpellingQueryConverter

      1. SOLR-1657_part2.patch
        22 kB
        Robert Muir
      2. SOLR-1657_synonyms_ugly_slightly_less_slow.patch
        11 kB
        Robert Muir
      3. SOLR-1657_synonyms_ugly_slow.patch
        11 kB
        Robert Muir
      4. SOLR-1657.patch
        32 kB
        Robert Muir
      5. SOLR-1657.patch
        27 kB
        Chris Male
      6. SOLR-1657.patch
        22 kB
        Robert Muir
      7. SOLR-1657.patch
        18 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
          Hide
          Robert Muir added a comment -

          This was resolved in revision 922957.

          Show
          Robert Muir added a comment - This was resolved in revision 922957.
          Hide
          Robert Muir added a comment -

          attached is a less slow version of the above.
          it preserves the fast path from the previous code.

          Show
          Robert Muir added a comment - attached is a less slow version of the above. it preserves the fast path from the previous code.
          Hide
          Robert Muir added a comment -

          A very very ugly, very slow, but simple and conservative conversion of SynonymFilter to the new TokenStream API.

          Show
          Robert Muir added a comment - A very very ugly, very slow, but simple and conservative conversion of SynonymFilter to the new TokenStream API.
          Hide
          Robert Muir added a comment -

          Here's a separate patch (_part2.patch) for all the remaining tokenstreams.

          The only one remaining now is SynonymFilter.

          For several areas in this patch, I didn't properly change any APIs to fully
          support the new Attributes-based API, I just got them off deprecated methods,
          still working with Token, and left TODOs.

          I figure it would be better to hash this out later on separate issues, where
          we modify those APIs to really take advantage of an Attributes-based API.

          Show
          Robert Muir added a comment - Here's a separate patch (_part2.patch) for all the remaining tokenstreams. The only one remaining now is SynonymFilter. For several areas in this patch, I didn't properly change any APIs to fully support the new Attributes-based API, I just got them off deprecated methods, still working with Token, and left TODOs. I figure it would be better to hash this out later on separate issues, where we modify those APIs to really take advantage of an Attributes-based API.
          Hide
          Uwe Schindler added a comment -

          This may be interesting for SynonymTokenFilter: LUCENE-2314

          Show
          Uwe Schindler added a comment - This may be interesting for SynonymTokenFilter: LUCENE-2314
          Hide
          Robert Muir added a comment -

          Chris's patch, except also "implement" BufferedTokenStream. its marked deprecated, its api cannot support custom attributes (so the 6 are simply copied into tokens and back), and its unused in solr with this patch.

          Show
          Robert Muir added a comment - Chris's patch, except also "implement" BufferedTokenStream. its marked deprecated, its api cannot support custom attributes (so the 6 are simply copied into tokens and back), and its unused in solr with this patch.
          Hide
          Chris Male added a comment -

          Attaching a small update to Robert's patch. Includes some small code readability improvements, much along the same lines that I've done for the WordDelimiterFilter in SOLR-1710.

          Show
          Chris Male added a comment - Attaching a small update to Robert's patch. Includes some small code readability improvements, much along the same lines that I've done for the WordDelimiterFilter in SOLR-1710 .
          Hide
          Robert Muir added a comment -

          striking thru WDF since i think its at least close.

          Show
          Robert Muir added a comment - striking thru WDF since i think its at least close.
          Hide
          Robert Muir added a comment -

          Not sure... I guess it depends on the attribute and what it does.

          me neither! well there are 2 patches now on SOLR-1710, so if we don't want this we can just use the first one.
          i thought about this one a lot and came to the conclusion that if you really care about your custom attributes making sense, you will use preserveOriginal, but i think both versions work well with that line of reasoning.

          Show
          Robert Muir added a comment - Not sure... I guess it depends on the attribute and what it does. me neither! well there are 2 patches now on SOLR-1710 , so if we don't want this we can just use the first one. i thought about this one a lot and came to the conclusion that if you really care about your custom attributes making sense, you will use preserveOriginal, but i think both versions work well with that line of reasoning.
          Hide
          Yonik Seeley added a comment -

          But the odd thing about this will be, when 'WDF is just removing punctuation' && preserveOriginal == true, obviously the attributes will only apply to the original... does this make sense?

          Not sure... I guess it depends on the attribute and what it does.

          Show
          Yonik Seeley added a comment - But the odd thing about this will be, when 'WDF is just removing punctuation' && preserveOriginal == true, obviously the attributes will only apply to the original... does this make sense? Not sure... I guess it depends on the attribute and what it does.
          Hide
          Robert Muir added a comment -

          Yonik, I agree, this is almost what the current patch does (take a look if you want, SOLR-1710).

          There is one difference i must change, the 'when WDF is just removing punctuation' case. Current patch does not preserve attributes for this case (you must use preserveOriginal=true)

          But the odd thing about this will be, when 'WDF is just removing punctuation' && preserveOriginal == true, obviously the attributes will only apply to the original... does this make sense?

          I will make the change to the SOLR-1710 patch.

          Show
          Robert Muir added a comment - Yonik, I agree, this is almost what the current patch does (take a look if you want, SOLR-1710 ). There is one difference i must change, the 'when WDF is just removing punctuation' case. Current patch does not preserve attributes for this case (you must use preserveOriginal=true) But the odd thing about this will be, when 'WDF is just removing punctuation' && preserveOriginal == true, obviously the attributes will only apply to the original... does this make sense? I will make the change to the SOLR-1710 patch.
          Hide
          Yonik Seeley added a comment -

          What about preserving the attributes for just the first token? That makes a lot of sense in many cases (say when WDF is just removing punctuation).
          So if preserveOriginal==true, the first token would always be the original. This should also be the most performant since it's just a modification to the first token (offset and termText)?

          Show
          Yonik Seeley added a comment - What about preserving the attributes for just the first token? That makes a lot of sense in many cases (say when WDF is just removing punctuation). So if preserveOriginal==true, the first token would always be the original. This should also be the most performant since it's just a modification to the first token (offset and termText)?
          Hide
          Robert Muir added a comment -

          Hello, I am working on WordDelimiterFilter and I have a question: how do we want custom attributes to work here?

          This affects performance of the filter under the new tokenstream API, as it will determine when/if we have to save/restore state.

          Here are two alternatives:

          Alternative #1 (most performant): custom attributes from the original term will only apply to words with no delimiters, or in the case of words with delimiters, only the 'original' token output with the 'preserveOriginal' option. This is easiest to understand in my opinion, and would perform the best. Its arguable that if you split a term into 10 subwords, applying these attributes to all 10 subwords may no longer make sense

          Alternative #2: (least performant): custom attributes from the original term will only apply to non-injected terms: this means if a word is split into 10 tokens, all 10 subword tokens, but not their concatenations, also have the attributes derived from the original term. If preserveOriginal is on, then it has the attributes also.

          Alternative #3: ??? your ideas?

          In my opinion, we should shoot for maximum performance, as I view this as somewhat like a tokenizer, and custom attributes in general would be applied after WDF, because trying to apply them before WDF and expecting them to make sense afterwards will be confusing. but it does not matter really.

          Show
          Robert Muir added a comment - Hello, I am working on WordDelimiterFilter and I have a question: how do we want custom attributes to work here? This affects performance of the filter under the new tokenstream API, as it will determine when/if we have to save/restore state. Here are two alternatives: Alternative #1 (most performant): custom attributes from the original term will only apply to words with no delimiters, or in the case of words with delimiters, only the 'original' token output with the 'preserveOriginal' option. This is easiest to understand in my opinion, and would perform the best. Its arguable that if you split a term into 10 subwords, applying these attributes to all 10 subwords may no longer make sense Alternative #2: (least performant): custom attributes from the original term will only apply to non-injected terms: this means if a word is split into 10 tokens, all 10 subword tokens, but not their concatenations, also have the attributes derived from the original term. If preserveOriginal is on, then it has the attributes also. Alternative #3: ??? your ideas? In my opinion, we should shoot for maximum performance, as I view this as somewhat like a tokenizer, and custom attributes in general would be applied after WDF, because trying to apply them before WDF and expecting them to make sense afterwards will be confusing. but it does not matter really.
          Hide
          Robert Muir added a comment -

          this one includes a converted HyphenatedWordsFilter too.

          Show
          Robert Muir added a comment - this one includes a converted HyphenatedWordsFilter too.
          Hide
          Robert Muir added a comment -

          converts CommonGramsFilter, CommonGramsQueryFilter, and RemoveDuplicatesFilter to the new tokenstream API.

          with this patch, no solr tokenstreams extend BufferedTokenStream, so now it can be removed, deprecated, converted with a warning it doesn't reuse (it never did), or whatever.

          Show
          Robert Muir added a comment - converts CommonGramsFilter, CommonGramsQueryFilter, and RemoveDuplicatesFilter to the new tokenstream API. with this patch, no solr tokenstreams extend BufferedTokenStream, so now it can be removed, deprecated, converted with a warning it doesn't reuse (it never did), or whatever.
          Hide
          Robert Muir added a comment -

          Hi Uwe, I was thinking about some ideas on how we could do this, here is what I think in general:

          • CommonGrams[Query]Filter and RemoveDuplicatesFilter should be implemented with new api directly, I think they can be a lot more efficient.
          • BufferedTokenStream should be deprecated, it cannot properly support the attributes api, only those in Token.
          • Non-final tokenstreams should be made final.

          So according to the above there are some breaks in api compatibility, but it is not possible to move to the new api without breaks. This is because you cannot properly support the attributes api, yet at the same time extend BufferedTokenStream (see CommonGrams for an example). on top of this these are non-final! So it is impossible in my opinion.

          Show
          Robert Muir added a comment - Hi Uwe, I was thinking about some ideas on how we could do this, here is what I think in general: CommonGrams [Query] Filter and RemoveDuplicatesFilter should be implemented with new api directly, I think they can be a lot more efficient. BufferedTokenStream should be deprecated, it cannot properly support the attributes api, only those in Token. Non-final tokenstreams should be made final. So according to the above there are some breaks in api compatibility, but it is not possible to move to the new api without breaks. This is because you cannot properly support the attributes api, yet at the same time extend BufferedTokenStream (see CommonGrams for an example). on top of this these are non-final! So it is impossible in my opinion.
          Hide
          Uwe Schindler added a comment -

          i would help working on this, but before SOLR-1674 should be committed. Otherwise I would get stuck with multiple patches.

          Show
          Uwe Schindler added a comment - i would help working on this, but before SOLR-1674 should be committed. Otherwise I would get stuck with multiple patches.
          Hide
          Robert Muir added a comment -

          + test classes that use old stuff for testing

          in fact I am working on the tests first. In my opinion what we should do is take the functionality of lucene's BaseTokenStreamTestCase and add it to BaseTokenTestCase.
          Perhaps this old stuff can be implemented on top of assertTokenStreamContents, assertAnalyzesTo, etc, or we can change the tests.

          Either way, my reasoning is that this logic is very careful about ensuring that tokenstreams behave properly, it inserts garbage data to make sure that attributes are cleared correctly, etc, etc.

          Show
          Robert Muir added a comment - + test classes that use old stuff for testing in fact I am working on the tests first. In my opinion what we should do is take the functionality of lucene's BaseTokenStreamTestCase and add it to BaseTokenTestCase. Perhaps this old stuff can be implemented on top of assertTokenStreamContents, assertAnalyzesTo, etc, or we can change the tests. Either way, my reasoning is that this logic is very careful about ensuring that tokenstreams behave properly, it inserts garbage data to make sure that attributes are cleared correctly, etc, etc.
          Hide
          Mark Miller added a comment -

          + test classes that use old stuff for testing

          Show
          Mark Miller added a comment - + test classes that use old stuff for testing

            People

            • Assignee:
              Mark Miller
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development