Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7273

New kuromoji TokenFilter to keep tokens by part-of-speech tags

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Kuromoji has JapanesePartOfSpeechStopFilter to drop tokens by their part-of-speech tags. In some cases, it would be convenient to keep tokens according to "keep" POS tags list.

      Example usage:

      // keeps proper nouns - location names only
      String[] tags = new String[]{"名詞-固有名詞-地域-一般"};
      Set<String> keeptags = new HashSet<>();
      for (String tag: tags) {
        keeptags.add(tag);
      }
      JapaneseTokenizer tokenizer = new JapaneseTokenizer(null, false, JapaneseTokenizer.Mode.SEARCH);
      JapanesePartOfSpeechKeepFilter stream = new JapanesePartOfSpeechKeepFilter(tokenizer, keeptags);
      
      <!-- (Solr) analyzer definition -->
      <fieldType name="text_ja_propernoun" class="solr.TextField" positionIncrementGap="100" 
                 autoGeneratePhraseQueries="false">
          <analyzer>
              <tokenizer class="solr.JapaneseTokenizerFactory" mode="normal"/>
              <filter class="solr.CJKWidthFilterFactory"/>
              <filter class="solr.JapanesePartOfSpeechKeepFilterFactory" tags="lang/keeptags_ja.txt" />
              <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
              <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
      </fieldType>
      

      Of course it can be achieved by using JapanesePartOfSpeechStopFilter, however because there are about 70 part-of-speeches, it can be cumbersome to list all stop tags to keep tokens with few POS tags of interest.

      I'll add a patch soon.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tomoko Tomoko Uchida
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: