Solr
  1. Solr
  2. SOLR-3056

Introduce Japanese field type in schema.xml

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again Robert, Uwe and Simon). It would be very good to get a default field type defined for Japanese in schema.xml so we can good Japanese out-of-the-box support in Solr.

      I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic?

      In order to make the below text_ja field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land.

      Any thoughts?

      <!-- Text field type is suitable for Japanese text using morphological analysis
      
           NOTE: Please copy files
             contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
             dist/apache-solr-analysis-extras-x.y.z.jar
           to your Solr lib directory (i.e. example/solr/lib) before before starting Solr.
           (x.y.z refers to a version number)
      
           If you would like to optimize for precision, default operator AND with
             <solrQueryParser defaultOperator="AND"/>
           below (this file).  Use "OR" if you would like to optimize for recall (default).
      -->
      <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
        <analyzer>
          <!-- Kuromoji Japanese morphological analyzer/tokenizer
      
               Use search-mode to get a noun-decompounding effect useful for search.
      
               Example:
                 関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 (International) 空港 (airport)
                 so we get a match for 空港 (airport) as we would expect from a good search engine
      
               Valid values for mode are:
                  normal: default segmentation
                  search: segmentation useful for search (extra compound splitting)
                extended: search mode with unigramming of unknown words (experimental)
      
               NOTE: Search mode improves segmentation for search at the expense of part-of-speech accuracy
          -->
          <tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/>
          <!-- Reduces inflected verbs and adjectives to their base/dectionary forms (辞書形) -->	
          <filter class="solr.KuromojiBaseFormFilterFactory"/>
          <!-- Optionally remove tokens with certain part-of-speeches
          <filter class="solr.KuromojiPartOfSpeechStopFilterFactory" tags="stopTags.txt" enablePositionIncrements="true"/> -->
          <!-- Normalizes full-width romaji to half-with and half-width kana to full-width (Unicode NFKC subset) -->
          <filter class="solr.CJKWidthFilterFactory"/>
          <!-- Lower-case romaji characters -->
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
      </fieldType>
      
      1. SOLR-3056.patch
        4 kB
        Robert Muir
      2. SOLR-3056.patch
        5 kB
        Christian Moen
      3. SOLR-3056_typo.patch
        1.0 kB
        Robert Muir
      4. SOLR-3056_schema40.patch
        2 kB
        Christian Moen
      5. SOLR-3056_schema40.patch
        2 kB
        Christian Moen
      6. SOLR-3056_schema40.patch
        3 kB
        Christian Moen
      7. SOLR-3056_move.patch
        7 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          It would be very good to get a default field type defined for Japanese in schema.xml so we can good Japanese out-of-the-box support in Solr.

          I agree, we really need this for all languages, including stopwords_xx files and fieldtypes actually,
          but lets start with japanese because its complicated.

          I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic?

          I think the ideal situation would be to have a single reasonable default (like the configuration you have), but then also a
          full wiki page on Kuromoji explaining the different options, maybe even with alternative configurations or examples. we could
          link to this page from the other wikipages about the analyzers.

          In order to make the below text_ja field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land.

          Yeah I don't think having kuromoji in contrib is ideal. I think instead we should have examples for all supported languages
          so its easy to get started. Currently someone has to jump thru serious hoops to segment chinese or japanese into words,
          but as I mentioned before all non-english languages currently are 'hard' in that there are no fieldtypes setup for them.

          but anyway, my vote is to move these analyzers to core and nuke this contrib totally. But it would be great for some
          people to speak up and get consensus on this because it would only be more confusing to go back and forth between
          contrib and core.

          As far as the default configuration,

          Christian maybe if you have some time you could look at/review the stopTags.txt we have in the analyzer right now?

          http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/kuromoji/src/resources/org/apache/lucene/analysis/kuromoji/stoptags.txt?view=markup

          I created this file from the ipadic manual (there could/likely are silly errors too), in an attempt to also document the POS tagset.
          But we should also see if the uncommented POS tags in that file are appropriate for a 'good stop set'. I think i just arbitrarily
          picked a few trying to be conservative.

          Show
          Robert Muir added a comment - It would be very good to get a default field type defined for Japanese in schema.xml so we can good Japanese out-of-the-box support in Solr. I agree, we really need this for all languages, including stopwords_xx files and fieldtypes actually, but lets start with japanese because its complicated. I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic? I think the ideal situation would be to have a single reasonable default (like the configuration you have), but then also a full wiki page on Kuromoji explaining the different options, maybe even with alternative configurations or examples. we could link to this page from the other wikipages about the analyzers. In order to make the below text_ja field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land. Yeah I don't think having kuromoji in contrib is ideal. I think instead we should have examples for all supported languages so its easy to get started. Currently someone has to jump thru serious hoops to segment chinese or japanese into words, but as I mentioned before all non-english languages currently are 'hard' in that there are no fieldtypes setup for them. but anyway, my vote is to move these analyzers to core and nuke this contrib totally. But it would be great for some people to speak up and get consensus on this because it would only be more confusing to go back and forth between contrib and core. As far as the default configuration, Christian maybe if you have some time you could look at/review the stopTags.txt we have in the analyzer right now? http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/kuromoji/src/resources/org/apache/lucene/analysis/kuromoji/stoptags.txt?view=markup I created this file from the ipadic manual (there could/likely are silly errors too), in an attempt to also document the POS tagset. But we should also see if the uncommented POS tags in that file are appropriate for a 'good stop set'. I think i just arbitrarily picked a few trying to be conservative.
          Hide
          Christian Moen added a comment -

          Robert,

          Thanks for the feedback. I completely agree with the idea of providing extensive language support out-of-the-box. My primary goal with donating Kuromoji was to do exactly that for Japanese.

          I have to admit that I don't know all that much about the general state of the other items in contrib, but I at least think Kuromoji should be moved to core if we'd like to provide a good out-of-the-box Japanese experience. Perhaps moving other parts of contrib to core is a reasonable longer term goal?

          I believe Kuromoji ended up in contrib since that seemed like the most reasonable starting point for integration it at the time rather than careful consideration of where it should reside long term. Please feel free to chime in, Simon.

          I'm proposing that we move Kuromoji to core to make Japanese supported out-of-the-box with a useful field type and corresponding documentation on the wiki. Longer term, I think we should do the same for other languages well, but as you say, we can start with Japanese because it's complicated.

          I've also had a look at your stop POS tags. I haven't reviewed them for completeness against IPADIC, but the defaults you have chosen as stop POS tags look fine to me. Good job.

          Show
          Christian Moen added a comment - Robert, Thanks for the feedback. I completely agree with the idea of providing extensive language support out-of-the-box. My primary goal with donating Kuromoji was to do exactly that for Japanese. I have to admit that I don't know all that much about the general state of the other items in contrib, but I at least think Kuromoji should be moved to core if we'd like to provide a good out-of-the-box Japanese experience. Perhaps moving other parts of contrib to core is a reasonable longer term goal? I believe Kuromoji ended up in contrib since that seemed like the most reasonable starting point for integration it at the time rather than careful consideration of where it should reside long term. Please feel free to chime in, Simon. I'm proposing that we move Kuromoji to core to make Japanese supported out-of-the-box with a useful field type and corresponding documentation on the wiki. Longer term, I think we should do the same for other languages well, but as you say, we can start with Japanese because it's complicated. I've also had a look at your stop POS tags. I haven't reviewed them for completeness against IPADIC, but the defaults you have chosen as stop POS tags look fine to me. Good job.
          Hide
          Chris Male added a comment -

          but anyway, my vote is to move these analyzers to core and nuke this contrib totally. But it would be great for some
          people to speak up and get consensus on this because it would only be more confusing to go back and forth between
          contrib and core.

          +1 to moving them to core and removing the contrib. We support the Analyzers, they are of good quality with tests and obviously active development so should be treated as core.

          Show
          Chris Male added a comment - but anyway, my vote is to move these analyzers to core and nuke this contrib totally. But it would be great for some people to speak up and get consensus on this because it would only be more confusing to go back and forth between contrib and core. +1 to moving them to core and removing the contrib. We support the Analyzers, they are of good quality with tests and obviously active development so should be treated as core.
          Hide
          Uwe Schindler added a comment -

          +1, I don't see a reason why having a damn-stupid factory in contrib makes any sense, it can also reside in core. The actual analyzer is in the official Lucene distribution, so there is nothing wrong with moving to core, as its just "glue" code.

          Show
          Uwe Schindler added a comment - +1, I don't see a reason why having a damn-stupid factory in contrib makes any sense, it can also reside in core. The actual analyzer is in the official Lucene distribution, so there is nothing wrong with moving to core, as its just "glue" code.
          Hide
          Robert Muir added a comment -

          As a first step, lets adjust the analyzer defaults so that the Lucene analyzer supports search mode by default.
          I have a few questions about this mode I want to throw out there, so I'll create a new issue.

          Show
          Robert Muir added a comment - As a first step, lets adjust the analyzer defaults so that the Lucene analyzer supports search mode by default. I have a few questions about this mode I want to throw out there, so I'll create a new issue.
          Hide
          Robert Muir added a comment -

          I opened LUCENE-3726 for the search mode discussion.

          Show
          Robert Muir added a comment - I opened LUCENE-3726 for the search mode discussion.
          Hide
          Christian Moen added a comment -

          Robert, I've improved the search mode heuristic (see LUCENE-3730 with patch) and I've also provided some feedback on LUCENE-3726. Before providing a patch to use search mode as our default, I'd like to do some corpus-based testing to make sure overall segmentation quality is where I'd like it to be.

          As for this JIRA, I guess it has branched out into the following topics:

          1. Introduce field type for Japanese in schema.xml
          2. Move Kuromoji to core to make it generally available in Solr
          3. Get rid of contrib altogether

          There seems to be consensus to move Kuromoji to core from at least three people (excluding myself).

          Do you prefer that we conclude on LUCENE-3726 before we follow up on getting Japanese support for Solr and Lucene working out-of-the-box – or can we conclude on default search mode separately?

          I'm happy to start JIRAs for moving Kuromoji to get Japanese support in place if that's the best next course of action. Please advise. Many thanks.

          Show
          Christian Moen added a comment - Robert, I've improved the search mode heuristic (see LUCENE-3730 with patch) and I've also provided some feedback on LUCENE-3726 . Before providing a patch to use search mode as our default, I'd like to do some corpus-based testing to make sure overall segmentation quality is where I'd like it to be. As for this JIRA, I guess it has branched out into the following topics: Introduce field type for Japanese in schema.xml Move Kuromoji to core to make it generally available in Solr Get rid of contrib altogether There seems to be consensus to move Kuromoji to core from at least three people (excluding myself). Do you prefer that we conclude on LUCENE-3726 before we follow up on getting Japanese support for Solr and Lucene working out-of-the-box – or can we conclude on default search mode separately? I'm happy to start JIRAs for moving Kuromoji to get Japanese support in place if that's the best next course of action. Please advise. Many thanks.
          Hide
          Robert Muir added a comment -

          Get rid of contrib altogether

          I still want to do this eventually: but lets do kuromoji first.

          Do you prefer that we conclude on LUCENE-3726 before we follow up on getting Japanese support for Solr and Lucene working out-of-the-box – or can we conclude on default search mode separately?

          No reason to avoid working issues in parallel, I don't think any of these block each other.

          Show
          Robert Muir added a comment - Get rid of contrib altogether I still want to do this eventually: but lets do kuromoji first. Do you prefer that we conclude on LUCENE-3726 before we follow up on getting Japanese support for Solr and Lucene working out-of-the-box – or can we conclude on default search mode separately? No reason to avoid working issues in parallel, I don't think any of these block each other.
          Hide
          Robert Muir added a comment -

          Ill do the svn moves portion of this, so that we can iterate on the default configuration etc.

          I think what Christian defined in the description is the way to go as a start...

          Show
          Robert Muir added a comment - Ill do the svn moves portion of this, so that we can iterate on the default configuration etc. I think what Christian defined in the description is the way to go as a start...
          Hide
          Robert Muir added a comment -

          Here's the patch showing differences of the move:

          The three factories and 3 tests are svn move'd, but also StringMockSolrResourceLoader is moved to test-framework (i find myself using this in analysis tests, should probably look for more embedded dup-ed copies of this thing).

          I'll commit shortly.

          Show
          Robert Muir added a comment - Here's the patch showing differences of the move: The three factories and 3 tests are svn move'd, but also StringMockSolrResourceLoader is moved to test-framework (i find myself using this in analysis tests, should probably look for more embedded dup-ed copies of this thing). I'll commit shortly.
          Hide
          Christian Moen added a comment -

          Robert, I've build the latest trunk and I can confirm that the move is good. Thanks!

          Attached a patch to introduce the text_ja field type to the Solr example schema (example/solr/conf/schema.xml) for trunk. Will look at branch_3x in due time.

          Show
          Christian Moen added a comment - Robert, I've build the latest trunk and I can confirm that the move is good. Thanks! Attached a patch to introduce the text_ja field type to the Solr example schema ( example/solr/conf/schema.xml ) for trunk . Will look at branch_3x in due time.
          Hide
          Robert Muir added a comment -

          Don't worry, I can just merge whatever we do here to branch_3x (i think it should be exactly the same anyway)... so
          we don't need a separate 3.x patch.

          This patch looks good to me, though for good performance/relevance I think we should enable the stoptags by default?
          This would be consistent with the lucene analyzer (which btw also has a small stopwords file, do we need those? should it be improved?)
          I can just put the stoptags file in the configuration directory if you think this works.

          Show
          Robert Muir added a comment - Don't worry, I can just merge whatever we do here to branch_3x (i think it should be exactly the same anyway)... so we don't need a separate 3.x patch. This patch looks good to me, though for good performance/relevance I think we should enable the stoptags by default? This would be consistent with the lucene analyzer (which btw also has a small stopwords file, do we need those? should it be improved?) I can just put the stoptags file in the configuration directory if you think this works.
          Hide
          Robert Muir added a comment -

          A couple spelling nitpicks too

          • dectionary -> dictionary
          • part-of-speeches -> part-of-speech tags
          • half-with -> half-width
          Show
          Robert Muir added a comment - A couple spelling nitpicks too dectionary -> dictionary part-of-speeches -> part-of-speech tags half-with -> half-width
          Hide
          Christian Moen added a comment -

          Thanks for catching these and saving me the embarrassment of having them included in a release! Very sorry. The patch has been updated.

          Show
          Christian Moen added a comment - Thanks for catching these and saving me the embarrassment of having them included in a release! Very sorry. The patch has been updated.
          Hide
          Christian Moen added a comment - - edited

          Robert, let's enable stop-words and stop-tags by default.

          The stopwords list in the Lucene analyzer looks too small unless it's always used in combination with a stoptags filter. I'll look into both of these.

          Also, if we're using search mode, part-of-speech F will decrease so we might want to rely more on stopwords rather than stoptags if it goes down by a whole lot. However, since tokens agree in 99.7% of the cases based on the tests I did earlier – and the part-of-speech tags we'd typically use as stop tags aren't involved with token-splits done by search mode, I don't expect this to be an issue, but it's something to keep in mind.

          I'll run some tests to verify this and follow up by suggesting configuration.

          I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations.

          Show
          Christian Moen added a comment - - edited Robert, let's enable stop-words and stop-tags by default. The stopwords list in the Lucene analyzer looks too small unless it's always used in combination with a stoptags filter. I'll look into both of these. Also, if we're using search mode, part-of-speech F will decrease so we might want to rely more on stopwords rather than stoptags if it goes down by a whole lot. However, since tokens agree in 99.7% of the cases based on the tests I did earlier – and the part-of-speech tags we'd typically use as stop tags aren't involved with token-splits done by search mode, I don't expect this to be an issue, but it's something to keep in mind. I'll run some tests to verify this and follow up by suggesting configuration. I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations.
          Hide
          Robert Muir added a comment -

          I'll run some tests to verify this and follow up by suggesting configuration.

          I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations.

          Sounds great: the existing files were not created with tests at all and are really arbitrary,
          so I think this would be a big win.

          Show
          Robert Muir added a comment - I'll run some tests to verify this and follow up by suggesting configuration. I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations. Sounds great: the existing files were not created with tests at all and are really arbitrary, so I think this would be a big win.
          Hide
          Christian Moen added a comment -

          I've opened up LUCENE-3745 for stopwords and stoptags.

          Show
          Christian Moen added a comment - I've opened up LUCENE-3745 for stopwords and stoptags.
          Hide
          Christian Moen added a comment -

          Stopwords and stoptags for Solr are now tracked in SOLR-3097 and a patch is available.

          Show
          Christian Moen added a comment - Stopwords and stoptags for Solr are now tracked in SOLR-3097 and a patch is available.
          Hide
          Christian Moen added a comment -

          Updated patch for schema.xml on trunk.

          The field type text_ja now uses a KuromojiPartOfSpeechStopFilter and StopFilter for stopping and their configuration uses the stop sets in the SOLR-3097 patch. Hence, SOLR-3097 should be applied before or at the same time as this patch.

          Show
          Christian Moen added a comment - Updated patch for schema.xml on trunk . The field type text_ja now uses a KuromojiPartOfSpeechStopFilter and StopFilter for stopping and their configuration uses the stop sets in the SOLR-3097 patch. Hence, SOLR-3097 should be applied before or at the same time as this patch.
          Hide
          Robert Muir added a comment -

          Attached is Christians patch, synced up to trunk.

          Additionally, I modified the factory to be more lazy, such that you pay no RAM unless you then go and use text_ja.

          Segmenter itself is very lightweight (except the first time called, where the classloader ensures the singletons are loaded). In fact the Lucene tokenizer even has a no-arg ctor with "new Segmenter()".

          Because tokenstreams are reused anyway via threadlocal, we only call create() once per thread... and again its just a lightweight Segmenter which is likely cheaper than even all the attributesource stuff already needed for the tokenstream.

          So this has no impact on kuromoji's performance, just defers the initialization so that if you don't use text_ja the resources are not loaded.

          I reviewed the fieldtype, and only have one last question! (I didnt change anything from your configuration)

          I noticed the order of the tokenfilters is different from the order defined in KuromojiAnalyzer. This order can be important in some situations, so I think we should correct one or the other to be consistent?

          Show
          Robert Muir added a comment - Attached is Christians patch, synced up to trunk. Additionally, I modified the factory to be more lazy, such that you pay no RAM unless you then go and use text_ja. Segmenter itself is very lightweight (except the first time called, where the classloader ensures the singletons are loaded). In fact the Lucene tokenizer even has a no-arg ctor with "new Segmenter()". Because tokenstreams are reused anyway via threadlocal, we only call create() once per thread... and again its just a lightweight Segmenter which is likely cheaper than even all the attributesource stuff already needed for the tokenstream. So this has no impact on kuromoji's performance, just defers the initialization so that if you don't use text_ja the resources are not loaded. I reviewed the fieldtype, and only have one last question! (I didnt change anything from your configuration) I noticed the order of the tokenfilters is different from the order defined in KuromojiAnalyzer. This order can be important in some situations, so I think we should correct one or the other to be consistent?
          Hide
          Christian Moen added a comment - - edited

          Thanks a lot, Robert.

          I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations.

          I created LUCENE-3751 with a patch earlier make sure the default Lucene and Solr configurations are aligned. Sorry for not pointing this out clearly by linking the JIRAs.

          Show
          Christian Moen added a comment - - edited Thanks a lot, Robert. I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations. I created LUCENE-3751 with a patch earlier make sure the default Lucene and Solr configurations are aligned. Sorry for not pointing this out clearly by linking the JIRAs.
          Hide
          Robert Muir added a comment -

          Ugh, sorry Christian... I totally missed that issue!

          Lets take care of that one first...

          Show
          Robert Muir added a comment - Ugh, sorry Christian... I totally missed that issue! Lets take care of that one first...
          Hide
          Robert Muir added a comment -

          OK, LUCENE-3751 is good, but to totally match that I think we should adjust the stopfilter here
          to ignore case (ignoreCase="true") by default.

          Its a negligible cost, and, since it comes before the stopfilter, would prevent confusion if someone
          were to also add english stopwords (The, etc) to their stopset.

          Someone could always change to ignoreCase=false, but I think thats more expert, and only good
          as a default for languages like Turkish that have alternate casing behavior.

          Show
          Robert Muir added a comment - OK, LUCENE-3751 is good, but to totally match that I think we should adjust the stopfilter here to ignore case (ignoreCase="true") by default. Its a negligible cost, and, since it comes before the stopfilter, would prevent confusion if someone were to also add english stopwords (The, etc) to their stopset. Someone could always change to ignoreCase=false, but I think thats more expert, and only good as a default for languages like Turkish that have alternate casing behavior.
          Hide
          Robert Muir added a comment -

          Actually, sorry, this is consistent. I missed the fact KuromojiAnalyzer actually
          explicitly loads the stopset with ignoreCase=false.

          I still am unsure if we should do that, if the lowercasefilter is going to be
          after the stopwordfilter, just for the same reasons I mentioned above.

          Show
          Robert Muir added a comment - Actually, sorry, this is consistent. I missed the fact KuromojiAnalyzer actually explicitly loads the stopset with ignoreCase=false. I still am unsure if we should do that, if the lowercasefilter is going to be after the stopwordfilter, just for the same reasons I mentioned above.
          Hide
          Christian Moen added a comment - - edited

          Thanks, Robert.

          I was thinking to leave the StopFilter case-sensitive as I thought not having it normalized would give us flexibility, but it's also prone to error and surprises. I think it's reasonable to do make the default ignore case to support adding English or other romaji terms to the stopset with ease.

          However, if we following down this path path, we might also want to do width-normalization for the Japanese stopset to make sure there's no confusion with that, either. I suggest that we resolve that as a separate issue and just document this clearly in the stopset file.

          I think it's still reasonable to leave the LowerCaseFilter last as-is, though, so that users won't need to reorder the chain in case they want case-sensitive stopping.

          I'll update the configuration in both KuromojiAnalyzer and the text_ja field type to ignore case in their StopFilter tomorrow.

          Show
          Christian Moen added a comment - - edited Thanks, Robert. I was thinking to leave the StopFilter case-sensitive as I thought not having it normalized would give us flexibility, but it's also prone to error and surprises. I think it's reasonable to do make the default ignore case to support adding English or other romaji terms to the stopset with ease. However, if we following down this path path, we might also want to do width-normalization for the Japanese stopset to make sure there's no confusion with that, either. I suggest that we resolve that as a separate issue and just document this clearly in the stopset file. I think it's still reasonable to leave the LowerCaseFilter last as-is, though, so that users won't need to reorder the chain in case they want case-sensitive stopping. I'll update the configuration in both KuromojiAnalyzer and the text_ja field type to ignore case in their StopFilter tomorrow.
          Hide
          Robert Muir added a comment -

          However, if we following down this path path, we might also want to do width-normalization for the Japanese stopset to make sure there's no confusion with that, either. I suggest that we resolve that as a separate issue.

          Well, I think in general we could probably solve the width issue with documentation.
          The reason is that supporting a lot of different 'casing' schemes (especially ones that aren't 1:1, like normalizing width of kana),
          in CharArrayMap/Set could become confusing and tricky.

          For example, because GreekAnalyzer's stopword list expects sigma to always be 'σ' and never 'ς' (even in final position), we document
          that the stopword list should also be configured this way:

             * <b>NOTE:</b> The stopwords set should be pre-processed with the logic of 
             * {@link GreekLowerCaseFilter} for best results.
          

          But, I think we should also document any expectations in the example file itself, now that we are also using them as example configurations
          for Solr users (who we might expect, would never read the javadocs to the corresponding Analyzer).

          I'll redundantly add comments to the stoplists where appropriate for the other languages, but I think its a good way to solve the width issue too.

          Show
          Robert Muir added a comment - However, if we following down this path path, we might also want to do width-normalization for the Japanese stopset to make sure there's no confusion with that, either. I suggest that we resolve that as a separate issue. Well, I think in general we could probably solve the width issue with documentation. The reason is that supporting a lot of different 'casing' schemes (especially ones that aren't 1:1, like normalizing width of kana), in CharArrayMap/Set could become confusing and tricky. For example, because GreekAnalyzer's stopword list expects sigma to always be 'σ' and never 'ς' (even in final position), we document that the stopword list should also be configured this way: * <b>NOTE:</b> The stopwords set should be pre-processed with the logic of * {@link GreekLowerCaseFilter} for best results. But, I think we should also document any expectations in the example file itself, now that we are also using them as example configurations for Solr users (who we might expect, would never read the javadocs to the corresponding Analyzer). I'll redundantly add comments to the stoplists where appropriate for the other languages, but I think its a good way to solve the width issue too.
          Hide
          Christian Moen added a comment -

          I agree, Robert. I'll add suitable documentation to stopwords.txt to clarify case- and width-handling.

          Find attached a patch that includes your latest changes and a StopFilter ignoring case for text_ja. I've also revised the comments some and made sure the morphological field type text_ja references text_cjk to make users aware of the bigram alternative as well.

          Show
          Christian Moen added a comment - I agree, Robert. I'll add suitable documentation to stopwords.txt to clarify case- and width-handling. Find attached a patch that includes your latest changes and a StopFilter ignoring case for text_ja . I've also revised the comments some and made sure the morphological field type text_ja references text_cjk to make users aware of the bigram alternative as well.
          Hide
          Christian Moen added a comment -

          KuromojiAnalyzer (LUCENE-3751) has also been updated to ignore case in its StopFilter.

          Show
          Christian Moen added a comment - KuromojiAnalyzer ( LUCENE-3751 ) has also been updated to ignore case in its StopFilter .
          Hide
          Christian Moen added a comment -

          An improved description of stopwords.txt/stopwords_ja.txt with patch to clarify case- and width-handling is tracked by SOLR-3115.

          Show
          Christian Moen added a comment - An improved description of stopwords.txt / stopwords_ja.txt with patch to clarify case- and width-handling is tracked by SOLR-3115 .
          Hide
          Robert Muir added a comment -

          Thanks for the hard work here Christian!

          Show
          Robert Muir added a comment - Thanks for the hard work here Christian!
          Hide
          Robert Muir added a comment -

          found a tiny nitpick:

          just an unpaired xml comment... for some reason everything worked fine with this (i have no clue...)

          Show
          Robert Muir added a comment - found a tiny nitpick: just an unpaired xml comment... for some reason everything worked fine with this (i have no clue...)

            People

            • Assignee:
              Unassigned
              Reporter:
              Christian Moen
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development