Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-3653

Custom bigramming filter for to handle Smart Chinese edge cases

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn does not work in some edge cases. It fails to split certain words which were not part of the dictionary or training corpus.

      This patch supplies a bigramming class to handle these occasional mistakes. The algorithm creates bigrams out of all "words" longer than two ideograms.

        Attachments

        1. SOLR-3653.patch
          19 kB
          Lance Norskog
        2. SmartChineseType.pdf
          104 kB
          Lance Norskog
        3. translations_first_500.quad.txt
          9 kB
          Lance Norskog
        4. translations_first_500.trigrams.txt
          10 kB
          Lance Norskog
        5. translations_450.five2thirteen.txt
          11 kB
          Lance Norskog

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                lancenorskog Lance Norskog
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: