[SOLR-3056] Introduce Japanese field type in schema.xml - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.6, 4.0-ALPHA
Fix Version/s: 3.6, 4.0-ALPHA
Component/s: Schema and Analysis
Labels:
None

Description

Kuromoji (~~LUCENE-3305~~) is now on both on trunk and branch_3x (thanks again Robert, Uwe and Simon). It would be very good to get a default field type defined for Japanese in schema.xml so we can good Japanese out-of-the-box support in Solr.

I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic?

In order to make the below text_ja field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land.

Any thoughts?

<!-- Text field type is suitable for Japanese text using morphological analysis

     NOTE: Please copy files
       contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
       dist/apache-solr-analysis-extras-x.y.z.jar
     to your Solr lib directory (i.e. example/solr/lib) before before starting Solr.
     (x.y.z refers to a version number)

     If you would like to optimize for precision, default operator AND with
       <solrQueryParser defaultOperator="AND"/>
     below (this file).  Use "OR" if you would like to optimize for recall (default).
-->
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
  <analyzer>
    <!-- Kuromoji Japanese morphological analyzer/tokenizer

         Use search-mode to get a noun-decompounding effect useful for search.

         Example:
           関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 (International) 空港 (airport)
           so we get a match for 空港 (airport) as we would expect from a good search engine

         Valid values for mode are:
            normal: default segmentation
            search: segmentation useful for search (extra compound splitting)
          extended: search mode with unigramming of unknown words (experimental)

         NOTE: Search mode improves segmentation for search at the expense of part-of-speech accuracy
    -->
    <tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/>
    <!-- Reduces inflected verbs and adjectives to their base/dectionary forms (辞書形) -->	
    <filter class="solr.KuromojiBaseFormFilterFactory"/>
    <!-- Optionally remove tokens with certain part-of-speeches
    <filter class="solr.KuromojiPartOfSpeechStopFilterFactory" tags="stopTags.txt" enablePositionIncrements="true"/> -->
    <!-- Normalizes full-width romaji to half-with and half-width kana to full-width (Unicode NFKC subset) -->
    <filter class="solr.CJKWidthFilterFactory"/>
    <!-- Lower-case romaji characters -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-3056.patch
08/Feb/12 13:13
4 kB
Robert Muir
SOLR-3056.patch
09/Feb/12 10:42
5 kB
Christian Moen
SOLR-3056_typo.patch
09/Feb/12 22:58
1.0 kB
Robert Muir
SOLR-3056_schema40.patch
01/Feb/12 14:36
2 kB
Christian Moen
SOLR-3056_schema40.patch
01/Feb/12 15:11
2 kB
Christian Moen
SOLR-3056_schema40.patch
05/Feb/12 08:06
3 kB
Christian Moen
SOLR-3056_move.patch
01/Feb/12 12:55
7 kB
Robert Muir

Issue Links

requires

LUCENE-3751 Align default Japanese configurations for Lucene and Solr

Closed

SOLR-3097 Introduce default Japanese stoptags and stopwords to Solr's example configuration

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Christian Moen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Jan/12 18:22

Updated:: 10/May/13 10:39

Resolved:: 09/Feb/12 22:46