Lucene - Core
  1. Lucene - Core
  2. LUCENE-3726

Default KuromojiAnalyzer to use search mode

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Kuromoji supports an option to segment text in a way more suitable for search,
      by preventing long compound nouns as indexing terms.

      In general 'how you segment' can be important depending on the application
      (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this in chinese)

      The current algorithm punishes the cost based on some parameters (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
      for long runs of kanji.

      Some questions (these can be separate future issues if any useful ideas come out):

      • should these parameters continue to be static-final, or configurable?
      • should POS also play a role in the algorithm (can/should we refine exactly what we decompound)?
      • is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or both?
        with a tokenfilter, one idea would be to also preserve the original indexing term, overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0)
        from my understanding this tends to help with noun compounds in other languages, because IDF of the original term boosts 'exact' compound matches.
        but does a tokenfilter provide the segmenter enough 'context' to do this properly?

      Either way, I think as a start we should turn on what we have by default: its likely a very easy win.

      1. LUCENE-3726.patch
        2 kB
        Christian Moen
      2. LUCENE-3726.patch
        3 kB
        Christian Moen
      3. LUCENE-3726.patch
        4 kB
        Christian Moen
      4. kuromojieval.tar.gz
        2.01 MB
        Christian Moen

        Activity

        Hide
        Christian Moen added a comment -

        These are very interesting questions, Robert. Please find my comments below.

        should these parameters continue to be static-final, or configurable?

        It's perhaps possible to make these configurable, but I think we'd be exposing configuration that is most likely to confuse most users rather than help them.

        The values currently uses have been found using some analysis and experimentation, and they can probably be improved both in terms of tuning and with added heuristics – in particular for katakana compounds (more below).

        However, changing and improving this requires quite detailed analysis and testing, though. I think the major case for exposing them is as a means for easily tuning them rather than these parameters being generally useful to users.

        should POS also play a role in the algorithm (can/should we refine exactly what we decompound)?

        Very good question and an interesting idea.

        In the case of long kanji words such as 関西国際空港 (Kansai International Airport), which is a known noun, we can possible use POS info as a hint for applying the Viterbi penalty. In the case of unknown kanji, Kuromoji unigrams them. (関西国際空港 becomes 関西 国際 空港 (Kansai International Airport) using search mode.)

        Katakana compounds such as シニアソフトウェアエンジニア (senior software engineer) becomes one token without search mode, but when search mode is used, we get three tokens シニア ソフトウェア エンジニア as you would expect. It's also the case that シニアソフトウェアエンジニア is an unknown word, but its constituents become known and get the correct POS after search mode.

        In general, unknown words get a noun-POS (名詞) so the idea of using POS here should be fine.

        There are some problems with the katakana decompounding in search mode. For example, コニカミノルタホールディングス (Konika Minolta Holdings) becomes コニカ ミノルタ ホール ディングス (Konika Minolta horu dings), where we get the additional token ホール (also means hall, in Japanese).

        To sum up, I think we can potentially use the noun-POS as a hint when doing the decompounding in search mode, but I'm not sure how much we will benefit from it, but I like the idea. I think we'll benefit most from an improved heuristic for non-kanji to improve katakana decompounding.

        Let me have a tinker and see how I can improve this.

        is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or both?

        Interesting idea and good point regarding IDF.

        In order do the decompoundning, we'll need access to the lattice and add entries to it before we run the Viterbi. If we do normal segmentation first then run a decompounding filter, I think we'll need to run the Viterbi twice in order to get the desired results. (Optimizations are possible, though.)

        I'm thinking a possibility could be to expose possible decompounds as part of Kuromoji's Token interface. We can potentially have something like

        Token.java
        
        /**
         * Returns a list of possible decompounds for this token found by a heuristic
         * 
         * @return a list of candidate decompounds or null of none is found
         */
        List<Token> getDecompounds() {
          // ...
        }
        

        In the case of シニアソフトウェアエンジニア, the current token would have surface form シニアソフトウェアエンジニア, but with tokens シニア, ソフトウェア and エンジニア accessible using getDecompounds().

        As a general notice, I should point our that how well the heuristics performs depends on the dictionary/statistical model used (i.e. IPADIC) and if we might want to make different heuristics for each of those we support as needed.

        Show
        Christian Moen added a comment - These are very interesting questions, Robert. Please find my comments below. should these parameters continue to be static-final, or configurable? It's perhaps possible to make these configurable, but I think we'd be exposing configuration that is most likely to confuse most users rather than help them. The values currently uses have been found using some analysis and experimentation, and they can probably be improved both in terms of tuning and with added heuristics – in particular for katakana compounds (more below). However, changing and improving this requires quite detailed analysis and testing, though. I think the major case for exposing them is as a means for easily tuning them rather than these parameters being generally useful to users. should POS also play a role in the algorithm (can/should we refine exactly what we decompound)? Very good question and an interesting idea. In the case of long kanji words such as 関西国際空港 (Kansai International Airport), which is a known noun, we can possible use POS info as a hint for applying the Viterbi penalty. In the case of unknown kanji, Kuromoji unigrams them. (関西国際空港 becomes 関西 国際 空港 (Kansai International Airport) using search mode.) Katakana compounds such as シニアソフトウェアエンジニア (senior software engineer) becomes one token without search mode, but when search mode is used, we get three tokens シニア ソフトウェア エンジニア as you would expect. It's also the case that シニアソフトウェアエンジニア is an unknown word, but its constituents become known and get the correct POS after search mode. In general, unknown words get a noun-POS (名詞) so the idea of using POS here should be fine. There are some problems with the katakana decompounding in search mode. For example, コニカミノルタホールディングス (Konika Minolta Holdings) becomes コニカ ミノルタ ホール ディングス (Konika Minolta horu dings), where we get the additional token ホール (also means hall, in Japanese). To sum up, I think we can potentially use the noun-POS as a hint when doing the decompounding in search mode, but I'm not sure how much we will benefit from it, but I like the idea. I think we'll benefit most from an improved heuristic for non-kanji to improve katakana decompounding. Let me have a tinker and see how I can improve this. is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or both? Interesting idea and good point regarding IDF. In order do the decompoundning, we'll need access to the lattice and add entries to it before we run the Viterbi. If we do normal segmentation first then run a decompounding filter, I think we'll need to run the Viterbi twice in order to get the desired results. (Optimizations are possible, though.) I'm thinking a possibility could be to expose possible decompounds as part of Kuromoji's Token interface. We can potentially have something like Token.java /** * Returns a list of possible decompounds for this token found by a heuristic * * @ return a list of candidate decompounds or null of none is found */ List<Token> getDecompounds() { // ... } In the case of シニアソフトウェアエンジニア, the current token would have surface form シニアソフトウェアエンジニア, but with tokens シニア, ソフトウェア and エンジニア accessible using getDecompounds() . As a general notice, I should point our that how well the heuristics performs depends on the dictionary/statistical model used (i.e. IPADIC) and if we might want to make different heuristics for each of those we support as needed.
        Hide
        Robert Muir added a comment - - edited

        I'm thinking a possibility could be to expose possible decompounds as part of Kuromoji's Token interface.

        I like this idea: I think it would give the most flexibility, we would populate some attribute from
        Token just like we do today for other attributes, and then actual indexing of compounds can be
        controlled with a configurable tokenfilter.

        Long term, this lets the tokenizer stay a tokenizer and prevents it from growing too complex.

        Show
        Robert Muir added a comment - - edited I'm thinking a possibility could be to expose possible decompounds as part of Kuromoji's Token interface. I like this idea: I think it would give the most flexibility, we would populate some attribute from Token just like we do today for other attributes, and then actual indexing of compounds can be controlled with a configurable tokenfilter. Long term, this lets the tokenizer stay a tokenizer and prevents it from growing too complex.
        Hide
        Christian Moen added a comment -

        Thanks for the feedback.

        I'm working on tuning the heuristics to improve accuracy of katakana segmentation in search mode.

        I'll keep you posted on results and a patch. Unit tests will document the cases.

        Show
        Christian Moen added a comment - Thanks for the feedback. I'm working on tuning the heuristics to improve accuracy of katakana segmentation in search mode. I'll keep you posted on results and a patch. Unit tests will document the cases.
        Hide
        Christian Moen added a comment -

        I've improved the heuristic and submitted a patch to LUCENE-3730, which covers the issue.

        We can now deal with cases such as コニカミノルタホールディングス and many others just fine. The former becomes コニカ ミノルタ ホールディングス as we'd like.

        I think we should apply LUCENE-3730 before changing any defaults – and also independently of changing any defaults. I think we should also make sure that the default we use for Lucene is consistent with the Solr's default in schema.xml for text_ja.

        I'll do additional tests on a Japanese corpus and provide feedback, and we can use this as a basis for how to follow up. Hopefully, we'll have sufficient and good data to conclude on this.

        Show
        Christian Moen added a comment - I've improved the heuristic and submitted a patch to LUCENE-3730 , which covers the issue. We can now deal with cases such as コニカミノルタホールディングス and many others just fine. The former becomes コニカ ミノルタ ホールディングス as we'd like. I think we should apply LUCENE-3730 before changing any defaults – and also independently of changing any defaults. I think we should also make sure that the default we use for Lucene is consistent with the Solr's default in schema.xml for text_ja . I'll do additional tests on a Japanese corpus and provide feedback, and we can use this as a basis for how to follow up. Hopefully, we'll have sufficient and good data to conclude on this.
        Hide
        Christian Moen added a comment -

        I've segmented some Japanese Wikipedia text into sentences (using a naive sentence segmenter) and then segmented each sentence using both normal and search mode with the Kuromoji on Github that has LUCENE-3730 applied. Segmentation with Kuromoji in Lucene should be similar overall (modulo some differences in punctuation handling).

        Search mode and normal mode segmentation match completely in 90.7% of the sentences segmented and there's a 99.6% match at the token level (when counting normal mode tokens).

        Find attached some HTML files with a total of 10,000 sentences that demonstrates the differences in segmentation.

        Overall, I think search mode does a decent job. I've written someone else doing Japanese NLP to get their second opinion, in particular if the kanji splitting should be made somewhat less eager to split three letter words.

        Show
        Christian Moen added a comment - I've segmented some Japanese Wikipedia text into sentences (using a naive sentence segmenter) and then segmented each sentence using both normal and search mode with the Kuromoji on Github that has LUCENE-3730 applied. Segmentation with Kuromoji in Lucene should be similar overall (modulo some differences in punctuation handling). Search mode and normal mode segmentation match completely in 90.7% of the sentences segmented and there's a 99.6% match at the token level (when counting normal mode tokens). Find attached some HTML files with a total of 10,000 sentences that demonstrates the differences in segmentation. Overall, I think search mode does a decent job. I've written someone else doing Japanese NLP to get their second opinion, in particular if the kanji splitting should be made somewhat less eager to split three letter words.
        Hide
        Christian Moen added a comment -

        The latest attached patch introduces a default mode in Segmenter, which is now Mode.SEARCH.

        This mode is used by KuromojiAnalyzer in Lucene without further code changes. The Solr factory duplicated the default mode, but now retrieves it from Segmenter. This way, we set the default mode for both Solr and Lucene in a single place (in Segmenter), which I find cleaner.

        I've also moved some constructors around in Segmenter and did some minor formatting/style changes.

        Show
        Christian Moen added a comment - The latest attached patch introduces a default mode in Segmenter , which is now Mode.SEARCH . This mode is used by KuromojiAnalyzer in Lucene without further code changes. The Solr factory duplicated the default mode, but now retrieves it from Segmenter . This way, we set the default mode for both Solr and Lucene in a single place (in Segmenter ), which I find cleaner. I've also moved some constructors around in Segmenter and did some minor formatting/style changes.
        Hide
        Robert Muir added a comment -

        Thanks Christian: I committed this.

        Show
        Robert Muir added a comment - Thanks Christian: I committed this.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development