Solr
  1. Solr
  2. SOLR-2210

Provide solr FilterFactory for Lucene ICUTokenizer

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The Lucene ICUTokenizer provides many benefits for multilingual tokenizing. There should be a ICUFilterFactory so that it can be used from Solr. There are probably some issues in terms of passing configuration parameters.

      1. SOLR-2210.patch
        47 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Thanks for opening this, Tom.

        I've got some barebones filters for some of this stuff on my computer.
        Because the ICU jar file is large, i was trying to see if i could solve LUCENE-2510 first, but this would only fix the problem for 4.0 anyway.
        I think we should just make an icu contrib for now, and put the factories (Tokenizer, Normalizer, Folding, Transliterator, Collation) and the jar file in there.

        Show
        Robert Muir added a comment - Thanks for opening this, Tom. I've got some barebones filters for some of this stuff on my computer. Because the ICU jar file is large, i was trying to see if i could solve LUCENE-2510 first, but this would only fix the problem for 4.0 anyway. I think we should just make an icu contrib for now, and put the factories (Tokenizer, Normalizer, Folding, Transliterator, Collation) and the jar file in there.
        Hide
        Robert Muir added a comment -

        actually another idea, would be to just make an 'extraAnalyzers' contrib.
        then we could also add factories for smart chinese, polish etc, without creating a ton of contribs.

        i think this would be a good solution to expose all the lucene analyzers to Solr,
        since to me, LUCENE-2510 seems tricky.

        Show
        Robert Muir added a comment - actually another idea, would be to just make an 'extraAnalyzers' contrib. then we could also add factories for smart chinese, polish etc, without creating a ton of contribs. i think this would be a good solution to expose all the lucene analyzers to Solr, since to me, LUCENE-2510 seems tricky.
        Hide
        Robert Muir added a comment -

        here's a start: makes an analysis-extras contrib with all the build logic, and factories for the icu filters.

        still todo: add support for custom normalization and custom tokenizer config, filters for smart chinese, and stempel.

        But i think its ok to commit this as-is and improve it in svn.

        Show
        Robert Muir added a comment - here's a start: makes an analysis-extras contrib with all the build logic, and factories for the icu filters. still todo: add support for custom normalization and custom tokenizer config, filters for smart chinese, and stempel. But i think its ok to commit this as-is and improve it in svn.
        Hide
        Robert Muir added a comment -

        ok, i committed the baseline code (rev 1030012, rev 1030022 in 3x).

        we can keep the issue open and just add patches against it for customization, etc.
        I just wanted to get all the build-system-stuff working so this was easy.

        Show
        Robert Muir added a comment - ok, i committed the baseline code (rev 1030012, rev 1030022 in 3x). we can keep the issue open and just add patches against it for customization, etc. I just wanted to get all the build-system-stuff working so this was easy.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tom Burton-West
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development