Solr
  1. Solr
  2. SOLR-2211

Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing. Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1. However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer. A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.

      1. SOLR-2211.patch
        6 kB
        Tom Burton-West

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Tom, for this one we just want to wrap org.apache.lucene.standard.UAX29Tokenizer, care to make a patch?

          Show
          Robert Muir added a comment - Tom, for this one we just want to wrap org.apache.lucene.standard.UAX29Tokenizer, care to make a patch?
          Hide
          Tom Burton-West added a comment -

          Sure, I'll give it a try. I've got large Monday morning backlog in my todo list today, so it will probably be towards the middle of the week.

          Show
          Tom Burton-West added a comment - Sure, I'll give it a try. I've got large Monday morning backlog in my todo list today, so it will probably be towards the middle of the week.
          Hide
          Robert Muir added a comment -

          Sounds great, this one has no external dependencies, so it can just be with the rest of the factories.

          I'll look at starting on the ant/build-system-stuff for SOLR-2210.

          Show
          Robert Muir added a comment - Sounds great, this one has no external dependencies, so it can just be with the rest of the factories. I'll look at starting on the ant/build-system-stuff for SOLR-2210 .
          Hide
          Tom Burton-West added a comment -

          Patch implements Solr UAX29TokenizerFactory and TestUAX29TokenizerFactory.

          Tom

          Show
          Tom Burton-West added a comment - Patch implements Solr UAX29TokenizerFactory and TestUAX29TokenizerFactory. Tom
          Hide
          Robert Muir added a comment -

          Thanks Tom, looks great. I'll commit soon.

          Show
          Robert Muir added a comment - Thanks Tom, looks great. I'll commit soon.
          Hide
          Robert Muir added a comment -

          Committed revision 1032776, 1032779 (3x).

          Thanks Tom!

          Show
          Robert Muir added a comment - Committed revision 1032776, 1032779 (3x). Thanks Tom!
          Hide
          Tom Burton-West added a comment -

          Thanks for all your help Robert. We will be testing this and the ICUTokenizer tomorrow against a few thousand documents to see how it impacts our unique term counts. I'll post results to the list once I have something interesting to report.

          Show
          Tom Burton-West added a comment - Thanks for all your help Robert. We will be testing this and the ICUTokenizer tomorrow against a few thousand documents to see how it impacts our unique term counts. I'll post results to the list once I have something interesting to report.
          Hide
          Robert Muir added a comment -

          Great, I look forward to the results.

          By the way, on SOLR-2210 i also added the ICU filters, you could consider replacing LowerCaseFilterFactory with ICUNormalizer2Factory (just use the defaults).
          In addition to better lowercasing (e.g. ß -> ss), this would also bring the advantages described in http://unicode.org/reports/tr15/

          Alternatively, if you are already using both LowerCaseFilterFactory and ASCIIFoldingFilterFactory, you can replace both with ICUFoldingFilterFactory,
          which goes further and also incorporates http://www.unicode.org/reports/tr30/tr30-4.html

          Show
          Robert Muir added a comment - Great, I look forward to the results. By the way, on SOLR-2210 i also added the ICU filters, you could consider replacing LowerCaseFilterFactory with ICUNormalizer2Factory (just use the defaults). In addition to better lowercasing (e.g. ß -> ss), this would also bring the advantages described in http://unicode.org/reports/tr15/ Alternatively, if you are already using both LowerCaseFilterFactory and ASCIIFoldingFilterFactory, you can replace both with ICUFoldingFilterFactory, which goes further and also incorporates http://www.unicode.org/reports/tr30/tr30-4.html
          Hide
          Steve Rowe added a comment -

          LUCENE-2763 swapped UAX29Tokenizer and StandardTokenizer - in that issue, I renamed UAX29Tokenizer to UAX29URLEmailTokenizer, and UAX29TokenizerFactory to UAX29URLEmailTokenizerFactory.

          Bottom line: if you want UAX#29 word break rules without URL and e-mail tokenization, go with StandardTokenizer(Factory).

          Show
          Steve Rowe added a comment - LUCENE-2763 swapped UAX29Tokenizer and StandardTokenizer - in that issue, I renamed UAX29Tokenizer to UAX29URLEmailTokenizer, and UAX29TokenizerFactory to UAX29URLEmailTokenizerFactory. Bottom line: if you want UAX#29 word break rules without URL and e-mail tokenization, go with StandardTokenizer(Factory).
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release

            People

            • Assignee:
              Robert Muir
              Reporter:
              Tom Burton-West
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development