Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-822

CharFilter - normalize characters before tokenizer

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      A new plugin which can be placed in front of <tokenizer/>.

      <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
        <analyzer>
          <charFilter class="solr.MappingCharFilterFactory" mapping="mapping_ja.txt" />
          <tokenizer class="solr.MappingCJKTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
      </fieldType>
      

      <charFilter/> can be multiple (chained). I'll post a JPEG file to show character normalization sample soon.

      MOTIVATION:
      In Japan, there are two types of tokenizers – N-gram (CJKTokenizer) and Morphological Analyzer.
      When we use morphological analyzer, because the analyzer uses Japanese dictionary to detect terms,
      we need to normalize characters before tokenization.

      I'll post a patch soon, too.

        Attachments

        1. SOLR-822-renameMethod.patch
          7 kB
          Koji Sekiguchi
        2. SOLR-822-for-1.3.patch
          60 kB
          Koji Sekiguchi
        3. SOLR-822.patch
          52 kB
          Koji Sekiguchi
        4. SOLR-822.patch
          52 kB
          Koji Sekiguchi
        5. SOLR-822.patch
          48 kB
          Koji Sekiguchi
        6. SOLR-822.patch
          57 kB
          Koji Sekiguchi
        7. SOLR-822.patch
          60 kB
          Koji Sekiguchi
        8. sample_mapping_ja.txt
          1 kB
          Koji Sekiguchi
        9. sample_mapping_ja.txt
          2 kB
          Koji Sekiguchi
        10. japanese-h-to-k-mapping.txt
          3 kB
          Mark Bennett
        11. character-normalization.JPG
          30 kB
          Koji Sekiguchi

          Activity

            People

            • Assignee:
              koji Koji Sekiguchi
              Reporter:
              koji Koji Sekiguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: