[SOLR-3056] Introduce Japanese field type in schema.xml - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.6, 4.0-ALPHA
Fix Version/s: 3.6, 4.0-ALPHA
Component/s: Schema and Analysis
Labels:
None

Description

Kuromoji (~~LUCENE-3305~~) is now on both on trunk and branch_3x (thanks again Robert, Uwe and Simon). It would be very good to get a default field type defined for Japanese in schema.xml so we can good Japanese out-of-the-box support in Solr.

I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic?

In order to make the below text_ja field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land.

Any thoughts?

<!-- Text field type is suitable for Japanese text using morphological analysis

     NOTE: Please copy files
       contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
       dist/apache-solr-analysis-extras-x.y.z.jar
     to your Solr lib directory (i.e. example/solr/lib) before before starting Solr.
     (x.y.z refers to a version number)

     If you would like to optimize for precision, default operator AND with
       <solrQueryParser defaultOperator="AND"/>
     below (this file).  Use "OR" if you would like to optimize for recall (default).
-->
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
  <analyzer>
    <!-- Kuromoji Japanese morphological analyzer/tokenizer

         Use search-mode to get a noun-decompounding effect useful for search.

         Example:
           関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 (International) 空港 (airport)
           so we get a match for 空港 (airport) as we would expect from a good search engine

         Valid values for mode are:
            normal: default segmentation
            search: segmentation useful for search (extra compound splitting)
          extended: search mode with unigramming of unknown words (experimental)

         NOTE: Search mode improves segmentation for search at the expense of part-of-speech accuracy
    -->
    <tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/>
    <!-- Reduces inflected verbs and adjectives to their base/dectionary forms (辞書形) -->	
    <filter class="solr.KuromojiBaseFormFilterFactory"/>
    <!-- Optionally remove tokens with certain part-of-speeches
    <filter class="solr.KuromojiPartOfSpeechStopFilterFactory" tags="stopTags.txt" enablePositionIncrements="true"/> -->
    <!-- Normalizes full-width romaji to half-with and half-width kana to full-width (Unicode NFKC subset) -->
    <filter class="solr.CJKWidthFilterFactory"/>
    <!-- Lower-case romaji characters -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-3056_move.patch
01/Feb/12 12:55
7 kB
Robert Muir
SOLR-3056_schema40.patch
05/Feb/12 08:06
3 kB
Christian Moen
SOLR-3056_schema40.patch
01/Feb/12 15:11
2 kB
Christian Moen
SOLR-3056_schema40.patch
01/Feb/12 14:36
2 kB
Christian Moen
SOLR-3056_typo.patch
09/Feb/12 22:58
1.0 kB
Robert Muir
SOLR-3056.patch
09/Feb/12 10:42
5 kB
Christian Moen
SOLR-3056.patch
08/Feb/12 13:13
4 kB
Robert Muir

Issue Links

requires

LUCENE-3751 Align default Japanese configurations for Lucene and Solr

Closed

SOLR-3097 Introduce default Japanese stoptags and stopwords to Solr's example configuration

Closed

Activity

Ascending order - Click to sort in descending order

Robert Muir added a comment - 22/Jan/12 19:35

It would be very good to get a default field type defined for Japanese in schema.xml so we can good Japanese out-of-the-box support in Solr.

I agree, we really need this for all languages, including stopwords_xx files and fieldtypes actually,
but lets start with japanese because its complicated.

I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic?

I think the ideal situation would be to have a single reasonable default (like the configuration you have), but then also a
full wiki page on Kuromoji explaining the different options, maybe even with alternative configurations or examples. we could
link to this page from the other wikipages about the analyzers.

In order to make the below text_ja field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land.

Yeah I don't think having kuromoji in contrib is ideal. I think instead we should have examples for all supported languages
so its easy to get started. Currently someone has to jump thru serious hoops to segment chinese or japanese into words,
but as I mentioned before all non-english languages currently are 'hard' in that there are no fieldtypes setup for them.

but anyway, my vote is to move these analyzers to core and nuke this contrib totally. But it would be great for some
people to speak up and get consensus on this because it would only be more confusing to go back and forth between
contrib and core.

As far as the default configuration,

Christian maybe if you have some time you could look at/review the stopTags.txt we have in the analyzer right now?

http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/kuromoji/src/resources/org/apache/lucene/analysis/kuromoji/stoptags.txt?view=markup

I created this file from the ipadic manual (there could/likely are silly errors too), in an attempt to also document the POS tagset.
But we should also see if the uncommented POS tags in that file are appropriate for a 'good stop set'. I think i just arbitrarily
picked a few trying to be conservative.

Robert Muir added a comment - 22/Jan/12 19:35 It would be very good to get a default field type defined for Japanese in schema.xml so we can good Japanese out-of-the-box support in Solr. I agree, we really need this for all languages, including stopwords_xx files and fieldtypes actually, but lets start with japanese because its complicated. I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic? I think the ideal situation would be to have a single reasonable default (like the configuration you have), but then also a full wiki page on Kuromoji explaining the different options, maybe even with alternative configurations or examples. we could link to this page from the other wikipages about the analyzers. In order to make the below text_ja field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land. Yeah I don't think having kuromoji in contrib is ideal. I think instead we should have examples for all supported languages so its easy to get started. Currently someone has to jump thru serious hoops to segment chinese or japanese into words, but as I mentioned before all non-english languages currently are 'hard' in that there are no fieldtypes setup for them. but anyway, my vote is to move these analyzers to core and nuke this contrib totally. But it would be great for some people to speak up and get consensus on this because it would only be more confusing to go back and forth between contrib and core. As far as the default configuration, Christian maybe if you have some time you could look at/review the stopTags.txt we have in the analyzer right now? http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/kuromoji/src/resources/org/apache/lucene/analysis/kuromoji/stoptags.txt?view=markup I created this file from the ipadic manual (there could/likely are silly errors too), in an attempt to also document the POS tagset. But we should also see if the uncommented POS tags in that file are appropriate for a 'good stop set'. I think i just arbitrarily picked a few trying to be conservative.

Christian Moen added a comment - 23/Jan/12 08:18

Robert,

Thanks for the feedback. I completely agree with the idea of providing extensive language support out-of-the-box. My primary goal with donating Kuromoji was to do exactly that for Japanese.

I have to admit that I don't know all that much about the general state of the other items in contrib, but I at least think Kuromoji should be moved to core if we'd like to provide a good out-of-the-box Japanese experience. Perhaps moving other parts of contrib to core is a reasonable longer term goal?

I believe Kuromoji ended up in contrib since that seemed like the most reasonable starting point for integration it at the time rather than careful consideration of where it should reside long term. Please feel free to chime in, Simon.

I'm proposing that we move Kuromoji to core to make Japanese supported out-of-the-box with a useful field type and corresponding documentation on the wiki. Longer term, I think we should do the same for other languages well, but as you say, we can start with Japanese because it's complicated.

I've also had a look at your stop POS tags. I haven't reviewed them for completeness against IPADIC, but the defaults you have chosen as stop POS tags look fine to me. Good job.

Christian Moen added a comment - 23/Jan/12 08:18 Robert, Thanks for the feedback. I completely agree with the idea of providing extensive language support out-of-the-box. My primary goal with donating Kuromoji was to do exactly that for Japanese. I have to admit that I don't know all that much about the general state of the other items in contrib, but I at least think Kuromoji should be moved to core if we'd like to provide a good out-of-the-box Japanese experience. Perhaps moving other parts of contrib to core is a reasonable longer term goal? I believe Kuromoji ended up in contrib since that seemed like the most reasonable starting point for integration it at the time rather than careful consideration of where it should reside long term. Please feel free to chime in, Simon. I'm proposing that we move Kuromoji to core to make Japanese supported out-of-the-box with a useful field type and corresponding documentation on the wiki. Longer term, I think we should do the same for other languages well, but as you say, we can start with Japanese because it's complicated. I've also had a look at your stop POS tags. I haven't reviewed them for completeness against IPADIC, but the defaults you have chosen as stop POS tags look fine to me. Good job.

Chris Male added a comment - 23/Jan/12 08:37

but anyway, my vote is to move these analyzers to core and nuke this contrib totally. But it would be great for some
people to speak up and get consensus on this because it would only be more confusing to go back and forth between
contrib and core.

+1 to moving them to core and removing the contrib. We support the Analyzers, they are of good quality with tests and obviously active development so should be treated as core.

Chris Male added a comment - 23/Jan/12 08:37 but anyway, my vote is to move these analyzers to core and nuke this contrib totally. But it would be great for some people to speak up and get consensus on this because it would only be more confusing to go back and forth between contrib and core. +1 to moving them to core and removing the contrib. We support the Analyzers, they are of good quality with tests and obviously active development so should be treated as core.

Uwe Schindler added a comment - 23/Jan/12 08:48

+1, I don't see a reason why having a damn-stupid factory in contrib makes any sense, it can also reside in core. The actual analyzer is in the official Lucene distribution, so there is nothing wrong with moving to core, as its just "glue" code.

Uwe Schindler added a comment - 23/Jan/12 08:48 +1, I don't see a reason why having a damn-stupid factory in contrib makes any sense, it can also reside in core. The actual analyzer is in the official Lucene distribution, so there is nothing wrong with moving to core, as its just "glue" code.

Robert Muir added a comment - 27/Jan/12 03:36

As a first step, lets adjust the analyzer defaults so that the Lucene analyzer supports search mode by default.
I have a few questions about this mode I want to throw out there, so I'll create a new issue.

Robert Muir added a comment - 27/Jan/12 03:36 As a first step, lets adjust the analyzer defaults so that the Lucene analyzer supports search mode by default. I have a few questions about this mode I want to throw out there, so I'll create a new issue.

Robert Muir added a comment - 27/Jan/12 03:48

I opened ~~LUCENE-3726~~ for the search mode discussion.

Robert Muir added a comment - 27/Jan/12 03:48 I opened LUCENE-3726 for the search mode discussion.

Christian Moen added a comment - 30/Jan/12 07:06

Robert, I've improved the search mode heuristic (see ~~LUCENE-3730~~ with patch) and I've also provided some feedback on ~~LUCENE-3726~~. Before providing a patch to use search mode as our default, I'd like to do some corpus-based testing to make sure overall segmentation quality is where I'd like it to be.

As for this JIRA, I guess it has branched out into the following topics:

Introduce field type for Japanese in schema.xml
Move Kuromoji to core to make it generally available in Solr
Get rid of contrib altogether

There seems to be consensus to move Kuromoji to core from at least three people (excluding myself).

Do you prefer that we conclude on ~~LUCENE-3726~~ before we follow up on getting Japanese support for Solr and Lucene working out-of-the-box – or can we conclude on default search mode separately?

I'm happy to start JIRAs for moving Kuromoji to get Japanese support in place if that's the best next course of action. Please advise. Many thanks.

Christian Moen added a comment - 30/Jan/12 07:06 Robert, I've improved the search mode heuristic (see LUCENE-3730 with patch) and I've also provided some feedback on LUCENE-3726 . Before providing a patch to use search mode as our default, I'd like to do some corpus-based testing to make sure overall segmentation quality is where I'd like it to be. As for this JIRA, I guess it has branched out into the following topics: Introduce field type for Japanese in schema.xml Move Kuromoji to core to make it generally available in Solr Get rid of contrib altogether There seems to be consensus to move Kuromoji to core from at least three people (excluding myself). Do you prefer that we conclude on LUCENE-3726 before we follow up on getting Japanese support for Solr and Lucene working out-of-the-box – or can we conclude on default search mode separately? I'm happy to start JIRAs for moving Kuromoji to get Japanese support in place if that's the best next course of action. Please advise. Many thanks.

Robert Muir added a comment - 30/Jan/12 11:55

Get rid of contrib altogether

I still want to do this eventually: but lets do kuromoji first.

Do you prefer that we conclude on ~~LUCENE-3726~~ before we follow up on getting Japanese support for Solr and Lucene working out-of-the-box – or can we conclude on default search mode separately?

No reason to avoid working issues in parallel, I don't think any of these block each other.

Robert Muir added a comment - 30/Jan/12 11:55 Get rid of contrib altogether I still want to do this eventually: but lets do kuromoji first. Do you prefer that we conclude on LUCENE-3726 before we follow up on getting Japanese support for Solr and Lucene working out-of-the-box – or can we conclude on default search mode separately? No reason to avoid working issues in parallel, I don't think any of these block each other.

Robert Muir added a comment - 01/Feb/12 12:48

Ill do the svn moves portion of this, so that we can iterate on the default configuration etc.

I think what Christian defined in the description is the way to go as a start...

Robert Muir added a comment - 01/Feb/12 12:48 Ill do the svn moves portion of this, so that we can iterate on the default configuration etc. I think what Christian defined in the description is the way to go as a start...

Robert Muir added a comment - 01/Feb/12 12:55

Here's the patch showing differences of the move:

The three factories and 3 tests are svn move'd, but also StringMockSolrResourceLoader is moved to test-framework (i find myself using this in analysis tests, should probably look for more embedded dup-ed copies of this thing).

I'll commit shortly.

Robert Muir added a comment - 01/Feb/12 12:55 Here's the patch showing differences of the move: The three factories and 3 tests are svn move'd, but also StringMockSolrResourceLoader is moved to test-framework (i find myself using this in analysis tests, should probably look for more embedded dup-ed copies of this thing). I'll commit shortly.

Christian Moen added a comment - 01/Feb/12 14:39

Robert, I've build the latest trunk and I can confirm that the move is good. Thanks!

Attached a patch to introduce the text_ja field type to the Solr example schema (example/solr/conf/schema.xml) for trunk. Will look at branch_3x in due time.

Christian Moen added a comment - 01/Feb/12 14:39 Robert, I've build the latest trunk and I can confirm that the move is good. Thanks! Attached a patch to introduce the text_ja field type to the Solr example schema ( example/solr/conf/schema.xml ) for trunk . Will look at branch_3x in due time.

Robert Muir added a comment - 01/Feb/12 14:44

Don't worry, I can just merge whatever we do here to branch_3x (i think it should be exactly the same anyway)... so
we don't need a separate 3.x patch.

This patch looks good to me, though for good performance/relevance I think we should enable the stoptags by default?
This would be consistent with the lucene analyzer (which btw also has a small stopwords file, do we need those? should it be improved?)
I can just put the stoptags file in the configuration directory if you think this works.

Robert Muir added a comment - 01/Feb/12 14:44 Don't worry, I can just merge whatever we do here to branch_3x (i think it should be exactly the same anyway)... so we don't need a separate 3.x patch. This patch looks good to me, though for good performance/relevance I think we should enable the stoptags by default? This would be consistent with the lucene analyzer (which btw also has a small stopwords file, do we need those? should it be improved?) I can just put the stoptags file in the configuration directory if you think this works.

Robert Muir added a comment - 01/Feb/12 14:50

A couple spelling nitpicks too

dectionary -> dictionary
part-of-speeches -> part-of-speech tags
half-with -> half-width

Robert Muir added a comment - 01/Feb/12 14:50 A couple spelling nitpicks too dectionary -> dictionary part-of-speeches -> part-of-speech tags half-with -> half-width

Christian Moen added a comment - 01/Feb/12 15:12

Thanks for catching these and saving me the embarrassment of having them included in a release! Very sorry. The patch has been updated.

Christian Moen added a comment - 01/Feb/12 15:12 Thanks for catching these and saving me the embarrassment of having them included in a release! Very sorry. The patch has been updated.

Christian Moen added a comment - 01/Feb/12 16:58 - edited

Robert, let's enable stop-words and stop-tags by default.

The stopwords list in the Lucene analyzer looks too small unless it's always used in combination with a stoptags filter. I'll look into both of these.

Also, if we're using search mode, part-of-speech F will decrease so we might want to rely more on stopwords rather than stoptags if it goes down by a whole lot. However, since tokens agree in 99.7% of the cases based on the tests I did earlier – and the part-of-speech tags we'd typically use as stop tags aren't involved with token-splits done by search mode, I don't expect this to be an issue, but it's something to keep in mind.

I'll run some tests to verify this and follow up by suggesting configuration.

I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations.

Christian Moen added a comment - 01/Feb/12 16:58 - edited Robert, let's enable stop-words and stop-tags by default. The stopwords list in the Lucene analyzer looks too small unless it's always used in combination with a stoptags filter. I'll look into both of these. Also, if we're using search mode, part-of-speech F will decrease so we might want to rely more on stopwords rather than stoptags if it goes down by a whole lot. However, since tokens agree in 99.7% of the cases based on the tests I did earlier – and the part-of-speech tags we'd typically use as stop tags aren't involved with token-splits done by search mode, I don't expect this to be an issue, but it's something to keep in mind. I'll run some tests to verify this and follow up by suggesting configuration. I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations.

Robert Muir added a comment - 01/Feb/12 17:50

I'll run some tests to verify this and follow up by suggesting configuration.

I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations.

Sounds great: the existing files were not created with tests at all and are really arbitrary,
so I think this would be a big win.

Robert Muir added a comment - 01/Feb/12 17:50 I'll run some tests to verify this and follow up by suggesting configuration. I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations. Sounds great: the existing files were not created with tests at all and are really arbitrary, so I think this would be a big win.

Christian Moen added a comment - 02/Feb/12 08:04

I've opened up ~~LUCENE-3745~~ for stopwords and stoptags.

Christian Moen added a comment - 02/Feb/12 08:04 I've opened up LUCENE-3745 for stopwords and stoptags.

Christian Moen added a comment - 05/Feb/12 08:09

Stopwords and stoptags for Solr are now tracked in ~~SOLR-3097~~ and a patch is available.

Christian Moen added a comment - 05/Feb/12 08:09 Stopwords and stoptags for Solr are now tracked in SOLR-3097 and a patch is available.

Christian Moen added a comment - 05/Feb/12 08:12

Updated patch for schema.xml on trunk.

The field type text_ja now uses a KuromojiPartOfSpeechStopFilter and StopFilter for stopping and their configuration uses the stop sets in the ~~SOLR-3097~~ patch. Hence, ~~SOLR-3097~~ should be applied before or at the same time as this patch.

Christian Moen added a comment - 05/Feb/12 08:12 Updated patch for schema.xml on trunk . The field type text_ja now uses a KuromojiPartOfSpeechStopFilter and StopFilter for stopping and their configuration uses the stop sets in the SOLR-3097 patch. Hence, SOLR-3097 should be applied before or at the same time as this patch.

Robert Muir added a comment - 08/Feb/12 13:13

Attached is Christians patch, synced up to trunk.

Additionally, I modified the factory to be more lazy, such that you pay no RAM unless you then go and use text_ja.

Segmenter itself is very lightweight (except the first time called, where the classloader ensures the singletons are loaded). In fact the Lucene tokenizer even has a no-arg ctor with "new Segmenter()".

Because tokenstreams are reused anyway via threadlocal, we only call create() once per thread... and again its just a lightweight Segmenter which is likely cheaper than even all the attributesource stuff already needed for the tokenstream.

So this has no impact on kuromoji's performance, just defers the initialization so that if you don't use text_ja the resources are not loaded.

I reviewed the fieldtype, and only have one last question! (I didnt change anything from your configuration)

I noticed the order of the tokenfilters is different from the order defined in KuromojiAnalyzer. This order can be important in some situations, so I think we should correct one or the other to be consistent?

Robert Muir added a comment - 08/Feb/12 13:13 Attached is Christians patch, synced up to trunk. Additionally, I modified the factory to be more lazy, such that you pay no RAM unless you then go and use text_ja. Segmenter itself is very lightweight (except the first time called, where the classloader ensures the singletons are loaded). In fact the Lucene tokenizer even has a no-arg ctor with "new Segmenter()". Because tokenstreams are reused anyway via threadlocal, we only call create() once per thread... and again its just a lightweight Segmenter which is likely cheaper than even all the attributesource stuff already needed for the tokenstream. So this has no impact on kuromoji's performance, just defers the initialization so that if you don't use text_ja the resources are not loaded. I reviewed the fieldtype, and only have one last question! (I didnt change anything from your configuration) I noticed the order of the tokenfilters is different from the order defined in KuromojiAnalyzer. This order can be important in some situations, so I think we should correct one or the other to be consistent?

Christian Moen added a comment - 08/Feb/12 13:59 - edited

Thanks a lot, Robert.

I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations.

I created ~~LUCENE-3751~~ with a patch earlier make sure the default Lucene and Solr configurations are aligned. Sorry for not pointing this out clearly by linking the JIRAs.

Christian Moen added a comment - 08/Feb/12 13:59 - edited Thanks a lot, Robert. I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations. I created LUCENE-3751 with a patch earlier make sure the default Lucene and Solr configurations are aligned. Sorry for not pointing this out clearly by linking the JIRAs.

Robert Muir added a comment - 08/Feb/12 14:08

Ugh, sorry Christian... I totally missed that issue!

Lets take care of that one first...

Robert Muir added a comment - 08/Feb/12 14:08 Ugh, sorry Christian... I totally missed that issue! Lets take care of that one first...

Robert Muir added a comment - 08/Feb/12 14:24

OK, ~~LUCENE-3751~~ is good, but to totally match that I think we should adjust the stopfilter here
to ignore case (ignoreCase="true") by default.

Its a negligible cost, and, since it comes before the stopfilter, would prevent confusion if someone
were to also add english stopwords (The, etc) to their stopset.

Someone could always change to ignoreCase=false, but I think thats more expert, and only good
as a default for languages like Turkish that have alternate casing behavior.

Robert Muir added a comment - 08/Feb/12 14:24 OK, LUCENE-3751 is good, but to totally match that I think we should adjust the stopfilter here to ignore case (ignoreCase="true") by default. Its a negligible cost, and, since it comes before the stopfilter, would prevent confusion if someone were to also add english stopwords (The, etc) to their stopset. Someone could always change to ignoreCase=false, but I think thats more expert, and only good as a default for languages like Turkish that have alternate casing behavior.

Robert Muir added a comment - 08/Feb/12 14:34

Actually, sorry, this is consistent. I missed the fact KuromojiAnalyzer actually
explicitly loads the stopset with ignoreCase=false.

I still am unsure if we should do that, if the lowercasefilter is going to be
after the stopwordfilter, just for the same reasons I mentioned above.

Robert Muir added a comment - 08/Feb/12 14:34 Actually, sorry, this is consistent. I missed the fact KuromojiAnalyzer actually explicitly loads the stopset with ignoreCase=false. I still am unsure if we should do that, if the lowercasefilter is going to be after the stopwordfilter, just for the same reasons I mentioned above.

Christian Moen added a comment - 08/Feb/12 15:24 - edited

Thanks, Robert.

I was thinking to leave the StopFilter case-sensitive as I thought not having it normalized would give us flexibility, but it's also prone to error and surprises. I think it's reasonable to do make the default ignore case to support adding English or other romaji terms to the stopset with ease.

However, if we following down this path path, we might also want to do width-normalization for the Japanese stopset to make sure there's no confusion with that, either. I suggest that we resolve that as a separate issue and just document this clearly in the stopset file.

I think it's still reasonable to leave the LowerCaseFilter last as-is, though, so that users won't need to reorder the chain in case they want case-sensitive stopping.

I'll update the configuration in both KuromojiAnalyzer and the text_ja field type to ignore case in their StopFilter tomorrow.

Christian Moen added a comment - 08/Feb/12 15:24 - edited Thanks, Robert. I was thinking to leave the StopFilter case-sensitive as I thought not having it normalized would give us flexibility, but it's also prone to error and surprises. I think it's reasonable to do make the default ignore case to support adding English or other romaji terms to the stopset with ease. However, if we following down this path path, we might also want to do width-normalization for the Japanese stopset to make sure there's no confusion with that, either. I suggest that we resolve that as a separate issue and just document this clearly in the stopset file. I think it's still reasonable to leave the LowerCaseFilter last as-is, though, so that users won't need to reorder the chain in case they want case-sensitive stopping. I'll update the configuration in both KuromojiAnalyzer and the text_ja field type to ignore case in their StopFilter tomorrow.

Robert Muir added a comment - 08/Feb/12 15:48

However, if we following down this path path, we might also want to do width-normalization for the Japanese stopset to make sure there's no confusion with that, either. I suggest that we resolve that as a separate issue.

Well, I think in general we could probably solve the width issue with documentation.
The reason is that supporting a lot of different 'casing' schemes (especially ones that aren't 1:1, like normalizing width of kana),
in CharArrayMap/Set could become confusing and tricky.

For example, because GreekAnalyzer's stopword list expects sigma to always be 'σ' and never 'ς' (even in final position), we document
that the stopword list should also be configured this way:

   * <b>NOTE:</b> The stopwords set should be pre-processed with the logic of 
   * {@link GreekLowerCaseFilter} for best results.

But, I think we should also document any expectations in the example file itself, now that we are also using them as example configurations
for Solr users (who we might expect, would never read the javadocs to the corresponding Analyzer).

I'll redundantly add comments to the stoplists where appropriate for the other languages, but I think its a good way to solve the width issue too.

Robert Muir added a comment - 08/Feb/12 15:48 However, if we following down this path path, we might also want to do width-normalization for the Japanese stopset to make sure there's no confusion with that, either. I suggest that we resolve that as a separate issue. Well, I think in general we could probably solve the width issue with documentation. The reason is that supporting a lot of different 'casing' schemes (especially ones that aren't 1:1, like normalizing width of kana), in CharArrayMap/Set could become confusing and tricky. For example, because GreekAnalyzer's stopword list expects sigma to always be 'σ' and never 'ς' (even in final position), we document that the stopword list should also be configured this way: * <b>NOTE:</b> The stopwords set should be pre-processed with the logic of * {@link GreekLowerCaseFilter} for best results. But, I think we should also document any expectations in the example file itself, now that we are also using them as example configurations for Solr users (who we might expect, would never read the javadocs to the corresponding Analyzer). I'll redundantly add comments to the stoplists where appropriate for the other languages, but I think its a good way to solve the width issue too.

Christian Moen added a comment - 09/Feb/12 10:48

I agree, Robert. I'll add suitable documentation to stopwords.txt to clarify case- and width-handling.

Find attached a patch that includes your latest changes and a StopFilter ignoring case for text_ja. I've also revised the comments some and made sure the morphological field type text_ja references text_cjk to make users aware of the bigram alternative as well.

Christian Moen added a comment - 09/Feb/12 10:48 I agree, Robert. I'll add suitable documentation to stopwords.txt to clarify case- and width-handling. Find attached a patch that includes your latest changes and a StopFilter ignoring case for text_ja . I've also revised the comments some and made sure the morphological field type text_ja references text_cjk to make users aware of the bigram alternative as well.

Christian Moen added a comment - 09/Feb/12 11:02

KuromojiAnalyzer (~~LUCENE-3751~~) has also been updated to ignore case in its StopFilter.

Christian Moen added a comment - 09/Feb/12 11:02 KuromojiAnalyzer ( LUCENE-3751 ) has also been updated to ignore case in its StopFilter .

Christian Moen added a comment - 09/Feb/12 11:38

An improved description of stopwords.txt/stopwords_ja.txt with patch to clarify case- and width-handling is tracked by ~~SOLR-3115~~.

Christian Moen added a comment - 09/Feb/12 11:38 An improved description of stopwords.txt / stopwords_ja.txt with patch to clarify case- and width-handling is tracked by SOLR-3115 .

Robert Muir added a comment - 09/Feb/12 22:46

Thanks for the hard work here Christian!

Robert Muir added a comment - 09/Feb/12 22:46 Thanks for the hard work here Christian!

Robert Muir added a comment - 09/Feb/12 22:58

found a tiny nitpick:

just an unpaired xml comment... for some reason everything worked fine with this (i have no clue...)

Robert Muir added a comment - 09/Feb/12 22:58 found a tiny nitpick: just an unpaired xml comment... for some reason everything worked fine with this (i have no clue...)

People

Assignee:: Unassigned

Reporter:: Christian Moen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Jan/12 18:22

Updated:: 10/May/13 10:39

Resolved:: 09/Feb/12 22:46