Solr
  1. Solr
  2. SOLR-822

CharFilter - normalize characters before tokenizer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      A new plugin which can be placed in front of <tokenizer/>.

      <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
        <analyzer>
          <charFilter class="solr.MappingCharFilterFactory" mapping="mapping_ja.txt" />
          <tokenizer class="solr.MappingCJKTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
      </fieldType>
      

      <charFilter/> can be multiple (chained). I'll post a JPEG file to show character normalization sample soon.

      MOTIVATION:
      In Japan, there are two types of tokenizers – N-gram (CJKTokenizer) and Morphological Analyzer.
      When we use morphological analyzer, because the analyzer uses Japanese dictionary to detect terms,
      we need to normalize characters before tokenization.

      I'll post a patch soon, too.

      1. japanese-h-to-k-mapping.txt
        3 kB
        Mark Bennett
      2. SOLR-822-renameMethod.patch
        7 kB
        Koji Sekiguchi
      3. SOLR-822-for-1.3.patch
        60 kB
        Koji Sekiguchi
      4. SOLR-822.patch
        60 kB
        Koji Sekiguchi
      5. sample_mapping_ja.txt
        2 kB
        Koji Sekiguchi
      6. SOLR-822.patch
        57 kB
        Koji Sekiguchi
      7. SOLR-822.patch
        48 kB
        Koji Sekiguchi
      8. sample_mapping_ja.txt
        1 kB
        Koji Sekiguchi
      9. SOLR-822.patch
        52 kB
        Koji Sekiguchi
      10. SOLR-822.patch
        52 kB
        Koji Sekiguchi
      11. character-normalization.JPG
        30 kB
        Koji Sekiguchi

        Activity

        Koji Sekiguchi created issue -
        Koji Sekiguchi made changes -
        Field Original Value New Value
        Attachment character-normalization.JPG [ 12392639 ]
        Koji Sekiguchi made changes -
        Attachment SOLR-822.patch [ 12392641 ]
        Koji Sekiguchi made changes -
        Attachment SOLR-822.patch [ 12392730 ]
        Koji Sekiguchi made changes -
        Attachment sample_mapping_ja.txt [ 12392733 ]
        Koji Sekiguchi made changes -
        Attachment SOLR-822.patch [ 12392977 ]
        Koji Sekiguchi made changes -
        Attachment SOLR-822.patch [ 12393648 ]
        Koji Sekiguchi made changes -
        Attachment sample_mapping_ja.txt [ 12393649 ]
        Koji Sekiguchi made changes -
        Assignee Koji Sekiguchi [ koji ]
        Affects Version/s 1.3 [ 12312486 ]
        Fix Version/s 1.4 [ 12313351 ]
        Koji Sekiguchi made changes -
        Attachment SOLR-822.patch [ 12393713 ]
        Koji Sekiguchi made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Koji Sekiguchi made changes -
        Attachment SOLR-822-for-1.3.patch [ 12394228 ]
        Koji Sekiguchi made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Koji Sekiguchi made changes -
        Attachment SOLR-822-renameMethod.patch [ 12402551 ]
        Koji Sekiguchi made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Mark Bennett made changes -
        Attachment japanese-h-to-k-mapping.txt [ 12408724 ]
        Grant Ingersoll made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Koji Sekiguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development