Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-822

CharFilter - normalize characters before tokenizer

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.3
    • 1.4
    • Schema and Analysis
    • None

    Description

      A new plugin which can be placed in front of <tokenizer/>.

      <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
        <analyzer>
          <charFilter class="solr.MappingCharFilterFactory" mapping="mapping_ja.txt" />
          <tokenizer class="solr.MappingCJKTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
      </fieldType>
      

      <charFilter/> can be multiple (chained). I'll post a JPEG file to show character normalization sample soon.

      MOTIVATION:
      In Japan, there are two types of tokenizers – N-gram (CJKTokenizer) and Morphological Analyzer.
      When we use morphological analyzer, because the analyzer uses Japanese dictionary to detect terms,
      we need to normalize characters before tokenization.

      I'll post a patch soon, too.

      Attachments

        1. character-normalization.JPG
          30 kB
          Koji Sekiguchi
        2. SOLR-822.patch
          52 kB
          Koji Sekiguchi
        3. SOLR-822.patch
          52 kB
          Koji Sekiguchi
        4. sample_mapping_ja.txt
          1 kB
          Koji Sekiguchi
        5. SOLR-822.patch
          48 kB
          Koji Sekiguchi
        6. SOLR-822.patch
          57 kB
          Koji Sekiguchi
        7. sample_mapping_ja.txt
          2 kB
          Koji Sekiguchi
        8. SOLR-822.patch
          60 kB
          Koji Sekiguchi
        9. SOLR-822-for-1.3.patch
          60 kB
          Koji Sekiguchi
        10. SOLR-822-renameMethod.patch
          7 kB
          Koji Sekiguchi
        11. japanese-h-to-k-mapping.txt
          3 kB
          Mark Bennett

        Activity

          People

            koji Koji Sekiguchi
            koji Koji Sekiguchi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: