Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.6.1, 6.0, 5.4.1, 6.0.1
    • Fix Version/s: 6.0.1
    • Component/s: modules/analysis
    • Labels:
    • Lucene Fields:
      New, Patch Available
    • Flags:
      Patch

      Description

      One of the challenges in search is recall of an item with a common typing variant. These cases can be as simple as lower/upper case in most languages, accented characters, or more complex morphological phenomena like prefix omitting, or constructing a character with some combining mark. This component addresses the cases, which are not covered by ASCII folding component, or more complex to design with other tools. The idea is that a linguist could provide the mappings in a tab-delimited file, which then can be directly used by Solr.

      The mappings are maintained in the tab-delimited file, which could be just a copy paste from Excel spreadsheet. This gives the linguists the opportunity to create the mappings, then for the developer to include them in Solr configuration. There are a few cases, when the mappings grow complex, where some additional debugging may be required. The mappings can contain any sequence of characters to any other sequence of characters.

      Some of the cases I discuss in detail document are handling the voiced vowels for Japanese; common typing substitutions for Korean, Russian, Polish; transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding for Japanese. In the appendix, I give an example of implementing a Russian light weight stemmer using this component.

      1. CharacterMappingComponent.pdf
        361 kB
        Ivan Provalov
      2. LUCENE-7321.patch
        58 kB
        Ivan Provalov

        Activity

        Hide
        iprovalo Ivan Provalov added a comment -

        Initial patch.

        Show
        iprovalo Ivan Provalov added a comment - Initial patch.
        Hide
        iprovalo Ivan Provalov added a comment -

        Detail component description.

        Show
        iprovalo Ivan Provalov added a comment - Detail component description.
        Hide
        koji Koji Sekiguchi added a comment -

        What is the advantage of this compared to MappingCharFilter?

        Show
        koji Koji Sekiguchi added a comment - What is the advantage of this compared to MappingCharFilter?
        Hide
        iprovalo Ivan Provalov added a comment -

        Koji, this one works on a token level, allowing do things like prefix/suffix manipulations. Graph generator and collapser also makes it user friendly when dealing with a lot of mappings (please see the attached description file).

        Show
        iprovalo Ivan Provalov added a comment - Koji, this one works on a token level, allowing do things like prefix/suffix manipulations. Graph generator and collapser also makes it user friendly when dealing with a lot of mappings (please see the attached description file).

          People

          • Assignee:
            Unassigned
            Reporter:
            iprovalo Ivan Provalov
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development