Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-3653

Custom bigramming filter for to handle Smart Chinese edge cases

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Schema and Analysis
    • None

    Description

      The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn does not work in some edge cases. It fails to split certain words which were not part of the dictionary or training corpus.

      This patch supplies a bigramming class to handle these occasional mistakes. The algorithm creates bigrams out of all "words" longer than two ideograms.

      Attachments

        1. translations_first_500.trigrams.txt
          10 kB
          Lance Norskog
        2. translations_first_500.quad.txt
          9 kB
          Lance Norskog
        3. translations_450.five2thirteen.txt
          11 kB
          Lance Norskog
        4. SOLR-3653.patch
          19 kB
          Lance Norskog
        5. SmartChineseType.pdf
          104 kB
          Lance Norskog

        Issue Links

          Activity

            People

              Unassigned Unassigned
              lancenorskog Lance Norskog
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: