Lucene - Core
  1. Lucene - Core
  2. LUCENE-3767

Explore streaming Viterbi search in Kuromoji

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've been playing with the idea of changing the Kuromoji viterbi
      search to be 2 passes (intersect, backtrace) instead of 4 passes
      (break into sentences, intersect, score, backtrace)... this is very
      much a work in progress, so I'm just getting my current state up.
      It's got tons of nocommits, doesn't properly handle the user dict nor
      extended modes yet, etc.

      One thing I'm playing with is to add a double backtrace for the long
      compound tokens, ie, instead of penalizing these tokens so that
      shorter tokens are picked, leave the scores unchanged but on backtrace
      take that penalty and use it as a threshold for a 2nd best
      segmentation...

      1. compound_diffs.txt
        48 kB
        Michael McCandless
      2. LUCENE-3767_branch_3x.patch
        187 kB
        Christian Moen
      3. LUCENE-3767.patch
        186 kB
        Michael McCandless
      4. LUCENE-3767.patch
        184 kB
        Michael McCandless
      5. LUCENE-3767.patch
        188 kB
        Michael McCandless
      6. LUCENE-3767.patch
        68 kB
        Michael McCandless
      7. LUCENE-3767.patch
        59 kB
        Michael McCandless
      8. SolrXml-5498.xml
        31 kB
        Christian Moen

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          New patch.

          I cleaned up some silly nocommits, got extended mode and user dict working, added a few more nocommits.

          All tests now pass if you set KuromojiTokenizer2.DO_OUTPUT_COMPOUND to false, which does the same score penalty logic as the current tokenizer.

          Show
          Michael McCandless added a comment - New patch. I cleaned up some silly nocommits, got extended mode and user dict working, added a few more nocommits. All tests now pass if you set KuromojiTokenizer2.DO_OUTPUT_COMPOUND to false, which does the same score penalty logic as the current tokenizer.
          Hide
          Michael McCandless added a comment -
          Show
          Michael McCandless added a comment - I created a branch for this issue: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3767
          Hide
          Michael McCandless added a comment -

          I think the branch is ready to land... I'll post an applyable patch
          soon.

          In Mode.SEARCH the tokenizer produces the same tokens as current
          trunk.

          The only real end-user visible change is the addition of
          Mode.SEARCH_WITH_COMPOUNDS, which can produce two paths (compound
          token + its segmentation). This mode uses the new
          PositionLengthAttribute to record how "long" the compound token is.

          In this mode, the Viterbi search first runs without penalties, but
          then, if a too-long token (a token where the penalty would have been >
          0) is in the best path, we effectively re-run the Viterbi under that
          compound token, this time with penalties included. If this results in
          a different backtrace, we add that into the output tokens as well.

          Note that this will not produce congruent results as Mode.SEARCH,
          because the 2nd segmentation runs "in context" of the best path,
          meaning the chosen best wordID before and after the compound token are
          "enforced" in the 2nd segmentation. Sometimes this results in still
          picking only the compound token where trunk today would have split it
          up. From TestQuality, the total number of edits was 4418 vs trunk's
          4828.

          I didn't explore this, but, we may want to use harsher penalties in
          SEARCH_WITH_COMPOUNDS mode, ie, since we're going to output the
          compound as well we may as well "try harder" to produce the 2nd best
          segmentation.

          I left the default mode as Mode.SEARCH... maybe if we can somehow
          run some relevance tests we can make the default SEARCH_WITH_COMPOUNDS.
          But it'd also be tricky at query time...

          It looks like the rolling Viterbi is a bit faster (~16%: 1460
          bytes/msec vs 1700 bytes/msec on TestQuality.testSingleText).

          Show
          Michael McCandless added a comment - I think the branch is ready to land... I'll post an applyable patch soon. In Mode.SEARCH the tokenizer produces the same tokens as current trunk. The only real end-user visible change is the addition of Mode.SEARCH_WITH_COMPOUNDS, which can produce two paths (compound token + its segmentation). This mode uses the new PositionLengthAttribute to record how "long" the compound token is. In this mode, the Viterbi search first runs without penalties, but then, if a too-long token (a token where the penalty would have been > 0) is in the best path, we effectively re-run the Viterbi under that compound token, this time with penalties included. If this results in a different backtrace, we add that into the output tokens as well. Note that this will not produce congruent results as Mode.SEARCH, because the 2nd segmentation runs "in context" of the best path, meaning the chosen best wordID before and after the compound token are "enforced" in the 2nd segmentation. Sometimes this results in still picking only the compound token where trunk today would have split it up. From TestQuality, the total number of edits was 4418 vs trunk's 4828. I didn't explore this, but, we may want to use harsher penalties in SEARCH_WITH_COMPOUNDS mode, ie, since we're going to output the compound as well we may as well "try harder" to produce the 2nd best segmentation. I left the default mode as Mode.SEARCH... maybe if we can somehow run some relevance tests we can make the default SEARCH_WITH_COMPOUNDS. But it'd also be tricky at query time... It looks like the rolling Viterbi is a bit faster (~16%: 1460 bytes/msec vs 1700 bytes/msec on TestQuality.testSingleText).
          Hide
          Michael McCandless added a comment -

          Applyable patch.

          Show
          Michael McCandless added a comment - Applyable patch.
          Hide
          Michael McCandless added a comment -

          Attaching a file showing the segmentation change when you run with
          SEARCH_WITH_COMPOUNDS vs SEARCH.

          Left is SEARCH and right is SEARCH_WITH_COMPOUNDS.

          I believe these diffs are due to the forced "best path context" that
          we re-tokenize a given compound word in. Ie, the left / right wordIDs
          were picked because they had best path with this compound word, but
          then we force the segmentation to also use these words.

          Show
          Michael McCandless added a comment - Attaching a file showing the segmentation change when you run with SEARCH_WITH_COMPOUNDS vs SEARCH. Left is SEARCH and right is SEARCH_WITH_COMPOUNDS. I believe these diffs are due to the forced "best path context" that we re-tokenize a given compound word in. Ie, the left / right wordIDs were picked because they had best path with this compound word, but then we force the segmentation to also use these words.
          Hide
          Christian Moen added a comment -

          Mike,

          Thanks a lot for this. I'd meant to comment on this earlier and I'd like to look further into the details, but I really like your idea of running the Viterbi in a streaming fashion.

          Kuromoji originally split input using two punctuation characters as this would be an articulation point in the lattice/graph in practice, but your idea is much more elegant and also faithful to the statistical model.

          As for dealing with compounds, the penalization is a crude hack as you know, but it turns to work quite well in practice as many of the "decompounds" are known to the statistical model. However, in cases where not not all of them are known, we sometimes get wrong decomounds. I've done some analysis of these cases and it's possible to add more heuristics to deal with some that are obviouslt wrong, such a word starting with a long vowel sound in katakana. This is a slippery slope that I'm reluctant to pursue...

          Robert mentioned earlier that he believes IPADIC could have been annotated with compounds as the documentation mentions them, but they're not part of the IPADIC model we are using. If it is possible to get the decompounds from the training data (Kyoto Corpus), a better overall approach is then to do regular segmentation (normal mode) and then provide the decompounds directly from the token info for the compounds. We might need to retrain the model and preserving the decompounds in order for this to work, but I think it is worth investigating.

          Show
          Christian Moen added a comment - Mike, Thanks a lot for this. I'd meant to comment on this earlier and I'd like to look further into the details, but I really like your idea of running the Viterbi in a streaming fashion. Kuromoji originally split input using two punctuation characters as this would be an articulation point in the lattice/graph in practice, but your idea is much more elegant and also faithful to the statistical model. As for dealing with compounds, the penalization is a crude hack as you know, but it turns to work quite well in practice as many of the "decompounds" are known to the statistical model. However, in cases where not not all of them are known, we sometimes get wrong decomounds. I've done some analysis of these cases and it's possible to add more heuristics to deal with some that are obviouslt wrong, such a word starting with a long vowel sound in katakana. This is a slippery slope that I'm reluctant to pursue... Robert mentioned earlier that he believes IPADIC could have been annotated with compounds as the documentation mentions them, but they're not part of the IPADIC model we are using. If it is possible to get the decompounds from the training data (Kyoto Corpus), a better overall approach is then to do regular segmentation (normal mode) and then provide the decompounds directly from the token info for the compounds. We might need to retrain the model and preserving the decompounds in order for this to work, but I think it is worth investigating.
          Hide
          Robert Muir added a comment -

          Robert mentioned earlier that he believes IPADIC could have been annotated with compounds as the documentation mentions them, but they're not part of the IPADIC model we are using. If it is possible to get the decompounds from the training data (Kyoto Corpus), a better overall approach is then to do regular segmentation (normal mode) and then provide the decompounds directly from the token info for the compounds. We might need to retrain the model and preserving the decompounds in order for this to work, but I think it is worth investigating.

          The dictionary documentation for the original ipadic has the ability to hold compound data (not in mecab-ipadic though, so maybe it was never implemented?!),
          but I don't actually see it in any implementations. So yeah, we would need to find a corpus containing compound information (and of course extend the file format
          and add support to kuromoji) to support that.

          However, would this really solve the total issue? Wouldn't that really only help for known kanji compounds... whereas most katakana compounds
          (e.g. the software engineer example) are expected to be OOV anyway? So it seems like, even if we ensured the dictionary was annotated for
          long kanji such that we always used decompounded forms, we need a 'heuristical' decomposition like search-mode either way, at least for
          the unknown katakana case?

          And I tend to like Mike's improvements from a relevance perspective for these reasons:

          1. keeping the original compound term for improved precision
          2. preventing compound decomposition from having any unrelated negative impact on the rest of the tokenization

          So I think we should pursue this change, even if we want to separately train a dictionary in the future, because in that case,
          we would just disable the kanji decomposition heuristic but keep the heuristic (obviously re-tuned!) for katakana?

          Show
          Robert Muir added a comment - Robert mentioned earlier that he believes IPADIC could have been annotated with compounds as the documentation mentions them, but they're not part of the IPADIC model we are using. If it is possible to get the decompounds from the training data (Kyoto Corpus), a better overall approach is then to do regular segmentation (normal mode) and then provide the decompounds directly from the token info for the compounds. We might need to retrain the model and preserving the decompounds in order for this to work, but I think it is worth investigating. The dictionary documentation for the original ipadic has the ability to hold compound data (not in mecab-ipadic though, so maybe it was never implemented?!), but I don't actually see it in any implementations. So yeah, we would need to find a corpus containing compound information (and of course extend the file format and add support to kuromoji) to support that. However, would this really solve the total issue? Wouldn't that really only help for known kanji compounds... whereas most katakana compounds (e.g. the software engineer example) are expected to be OOV anyway? So it seems like, even if we ensured the dictionary was annotated for long kanji such that we always used decompounded forms, we need a 'heuristical' decomposition like search-mode either way, at least for the unknown katakana case? And I tend to like Mike's improvements from a relevance perspective for these reasons: keeping the original compound term for improved precision preventing compound decomposition from having any unrelated negative impact on the rest of the tokenization So I think we should pursue this change, even if we want to separately train a dictionary in the future, because in that case, we would just disable the kanji decomposition heuristic but keep the heuristic (obviously re-tuned!) for katakana?
          Hide
          Robert Muir added a comment -

          I've done some analysis of these cases and it's possible to add more heuristics to deal with some that are obviouslt wrong, such a word starting with a long vowel sound in katakana. This is a slippery slope that I'm reluctant to pursue...

          I've wondered about that also at least for the unknown katakana case: (though I don't know all the rules that could be applied).

          Adding such heuristics isn't really unprecedented, in a way its very similar to the compounds/ package (geared towards german, etc)
          using TeX hyphenation rules to restrict word splits to hyphenation breaks; and similar to DictionaryBasedBreakIterators in
          ICU/your JRE that use orthographic rules in combination with a dictionary to segment southeast asian languages like Thai, and
          not too far from simple rules like "don't separate a base character from any combining characters that follow it", or "don't
          separate a lead surrogate from a trail surrogate" that you would generally use across all languages.

          Show
          Robert Muir added a comment - I've done some analysis of these cases and it's possible to add more heuristics to deal with some that are obviouslt wrong, such a word starting with a long vowel sound in katakana. This is a slippery slope that I'm reluctant to pursue... I've wondered about that also at least for the unknown katakana case: (though I don't know all the rules that could be applied). Adding such heuristics isn't really unprecedented, in a way its very similar to the compounds/ package (geared towards german, etc) using TeX hyphenation rules to restrict word splits to hyphenation breaks; and similar to DictionaryBasedBreakIterators in ICU/your JRE that use orthographic rules in combination with a dictionary to segment southeast asian languages like Thai, and not too far from simple rules like "don't separate a base character from any combining characters that follow it", or "don't separate a lead surrogate from a trail surrogate" that you would generally use across all languages.
          Hide
          Christian Moen added a comment -

          I agree completely; we should definitely proceed with Mike's improvements. A big +1 from me.

          I'm sorry if this wasn't clear. My comments on search mode are unrelated and I didn't mean to confuse these with the improvements made. None of this is necessary for Mike's improvements to be used, of course.

          Show
          Christian Moen added a comment - I agree completely; we should definitely proceed with Mike's improvements. A big +1 from me. I'm sorry if this wasn't clear. My comments on search mode are unrelated and I didn't mean to confuse these with the improvements made. None of this is necessary for Mike's improvements to be used, of course.
          Hide
          Robert Muir added a comment -

          Well my comments are only questions based on your previous JIRA comments,
          I don't really know anything about decompounding Japanese and my suggestions
          are only intuition, mostly based on patterns that have worked for other
          languages... so just consider it thinking out loud.

          I think your points on search mode are actually related here: though we can't
          really have a plan its good to think about possible paths we might take in the
          future so that we don't lock ourselves out.

          Show
          Robert Muir added a comment - Well my comments are only questions based on your previous JIRA comments, I don't really know anything about decompounding Japanese and my suggestions are only intuition, mostly based on patterns that have worked for other languages... so just consider it thinking out loud. I think your points on search mode are actually related here: though we can't really have a plan its good to think about possible paths we might take in the future so that we don't lock ourselves out.
          Hide
          Christian Moen added a comment -

          Robert, some comments are below.

          And I tend to like Mike's improvements from a relevance perspective for these reasons:

          1. keeping the original compound term for improved precision
          2. preventing compound decomposition from having any unrelated negative impact on the rest of the tokenization

          Very good points.

          So I think we should pursue this change, even if we want to separately train a dictionary in the future, because in that case, we would just disable the kanji decomposition heuristic but keep the heuristic (obviously re-tuned!) for katakana?

          I agree completely.

          The dictionary documentation for the original ipadic has the ability to hold compound data (not in mecab-ipadic though, so maybe it was never implemented?!), but I don't actually see it in any implementations. So yeah, we would need to find a corpus containing compound information (and of course extend the file format and add support to kuromoji) to support that.

          However, would this really solve the total issue? Wouldn't that really only help for known kanji compounds... whereas most katakana compounds (e.g. the software engineer example) are expected to be OOV anyway? So it seems like, even if we ensured the dictionary was annotated for long kanji such that we always used decompounded forms, we need a 'heuristical' decomposition like search-mode either way, at least for the unknown katakana case?

          I've made an inquiry to a friend who did his PhD work at Prof. Matsumoto's lab at NAIST (where ChaSen was made) and I've made en inquiry regarding compound information and the Kyoto Corpus.

          You are perfectly right that this doesn't solve the complete problem as unknown words can actually be compounds – unknown compounds. The approach used today is basically adding all the potential decompounds the model knows about to the lattice and see if a short path can be found often in combination with an unknown word.

          We get errors such as クイーンズコミックス (Queen's Comics) becoming クイーン ズコミックス (Queen Zukuomikkusu) because クイーン (Queen) is known.

          I'll open up a separate JIRA for discussing search-mode improvements.

          Show
          Christian Moen added a comment - Robert, some comments are below. And I tend to like Mike's improvements from a relevance perspective for these reasons: keeping the original compound term for improved precision preventing compound decomposition from having any unrelated negative impact on the rest of the tokenization Very good points. So I think we should pursue this change, even if we want to separately train a dictionary in the future, because in that case, we would just disable the kanji decomposition heuristic but keep the heuristic (obviously re-tuned!) for katakana? I agree completely. The dictionary documentation for the original ipadic has the ability to hold compound data (not in mecab-ipadic though, so maybe it was never implemented?!), but I don't actually see it in any implementations. So yeah, we would need to find a corpus containing compound information (and of course extend the file format and add support to kuromoji) to support that. However, would this really solve the total issue? Wouldn't that really only help for known kanji compounds... whereas most katakana compounds (e.g. the software engineer example) are expected to be OOV anyway? So it seems like, even if we ensured the dictionary was annotated for long kanji such that we always used decompounded forms, we need a 'heuristical' decomposition like search-mode either way, at least for the unknown katakana case? I've made an inquiry to a friend who did his PhD work at Prof. Matsumoto's lab at NAIST (where ChaSen was made) and I've made en inquiry regarding compound information and the Kyoto Corpus. You are perfectly right that this doesn't solve the complete problem as unknown words can actually be compounds – unknown compounds. The approach used today is basically adding all the potential decompounds the model knows about to the lattice and see if a short path can be found often in combination with an unknown word. We get errors such as クイーンズコミックス (Queen's Comics) becoming クイーン ズコミックス (Queen Zukuomikkusu) because クイーン (Queen) is known. I'll open up a separate JIRA for discussing search-mode improvements.
          Hide
          Christian Moen added a comment -

          I left the default mode as Mode.SEARCH... maybe if we can somehow run some relevance tests we can make the default SEARCH_WITH_COMPOUNDS.
          But it'd also be tricky at query time...

          Mike, could you share some details on any query-time trickiness here? Are you thinking about composing a query with both the compound and its parts/decompounds? Thanks.

          Show
          Christian Moen added a comment - I left the default mode as Mode.SEARCH... maybe if we can somehow run some relevance tests we can make the default SEARCH_WITH_COMPOUNDS. But it'd also be tricky at query time... Mike, could you share some details on any query-time trickiness here? Are you thinking about composing a query with both the compound and its parts/decompounds? Thanks.
          Hide
          Robert Muir added a comment -

          Mike, could you share some details on any query-time trickiness here? Are you thinking about composing a query with both the compound and its parts/decompounds? Thanks.

          I don't understand Mike's comment there either. positionIncrement=0 is really going to be treated like synonyms by the QP,
          so it shouldn't involve any trickiness.

          Show
          Robert Muir added a comment - Mike, could you share some details on any query-time trickiness here? Are you thinking about composing a query with both the compound and its parts/decompounds? Thanks. I don't understand Mike's comment there either. positionIncrement=0 is really going to be treated like synonyms by the QP, so it shouldn't involve any trickiness.
          Hide
          Michael McCandless added a comment -

          I left the default mode as Mode.SEARCH... maybe if we can somehow run some relevance tests we can make the default SEARCH_WITH_COMPOUNDS. But it'd also be tricky at query time...

          Mike, could you share some details on any query-time trickiness here? Are you thinking about composing a query with both the compound and its parts/decompounds? Thanks.

          Sorry, I was wrong about this! Apparently QueryParser works fine if
          the analyzer produces a graph (it just creates a
          somewhat-scary-yet-works MultiPhraseQuery).

          So I think we have no issue at query time... and I guess we should
          default to SEARCH_WITH_COMPOUNDS.

          I agree we can explore better tweaking the decompounding in a new
          issue.

          Show
          Michael McCandless added a comment - I left the default mode as Mode.SEARCH... maybe if we can somehow run some relevance tests we can make the default SEARCH_WITH_COMPOUNDS. But it'd also be tricky at query time... Mike, could you share some details on any query-time trickiness here? Are you thinking about composing a query with both the compound and its parts/decompounds? Thanks. Sorry, I was wrong about this! Apparently QueryParser works fine if the analyzer produces a graph (it just creates a somewhat-scary-yet-works MultiPhraseQuery). So I think we have no issue at query time... and I guess we should default to SEARCH_WITH_COMPOUNDS. I agree we can explore better tweaking the decompounding in a new issue.
          Hide
          Michael McCandless added a comment -

          New patch, making Mode.SEARCH always produce a graph output (ie no
          more separate SEARCH_WITH_COMPOUNDS). Mode.NORMAL and Mode.EXTENDED
          are as they were before...

          I think it's ready!

          Show
          Michael McCandless added a comment - New patch, making Mode.SEARCH always produce a graph output (ie no more separate SEARCH_WITH_COMPOUNDS). Mode.NORMAL and Mode.EXTENDED are as they were before... I think it's ready!
          Hide
          Christian Moen added a comment -

          Mike,

          Thanks a lot for this. I've been taking this patch for a spin and I'm seeing some exceptions being thrown when indexing some Wikipedia test documents.

          I haven't had a chance to analyze this in detail, but when indexing the attached SolrXml-5498.xml I'm getting:

          java.lang.ArrayIndexOutOfBoundsException: -69
          	at org.apache.lucene.util.RollingCharBuffer.get(RollingCharBuffer.java:98)
          	at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.computePenalty(KuromojiTokenizer.java:236)
          	at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.computeSecondBestThreshold(KuromojiTokenizer.java:227)
          	at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.backtrace(KuromojiTokenizer.java:910)
          	at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.parse(KuromojiTokenizer.java:567)
          	at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.incrementToken(KuromojiTokenizer.java:394)
          	at org.apache.lucene.analysis.kuromoji.TestKuromojiTokenizer.doTestBocchan(TestKuromojiTokenizer.java:567)
          	at org.apache.lucene.analysis.kuromoji.TestKuromojiTokenizer.testBocchan(TestKuromojiTokenizer.java:528)
          	...
          

          I believe you can reproduce this by changing TestKuromojiTokenizer.doTestBocchan() to read the attached SolrXml-5498.xml instead of bocchan.utf-8.

          Show
          Christian Moen added a comment - Mike, Thanks a lot for this. I've been taking this patch for a spin and I'm seeing some exceptions being thrown when indexing some Wikipedia test documents. I haven't had a chance to analyze this in detail, but when indexing the attached SolrXml-5498.xml I'm getting: java.lang.ArrayIndexOutOfBoundsException: -69 at org.apache.lucene.util.RollingCharBuffer.get(RollingCharBuffer.java:98) at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.computePenalty(KuromojiTokenizer.java:236) at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.computeSecondBestThreshold(KuromojiTokenizer.java:227) at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.backtrace(KuromojiTokenizer.java:910) at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.parse(KuromojiTokenizer.java:567) at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.incrementToken(KuromojiTokenizer.java:394) at org.apache.lucene.analysis.kuromoji.TestKuromojiTokenizer.doTestBocchan(TestKuromojiTokenizer.java:567) at org.apache.lucene.analysis.kuromoji.TestKuromojiTokenizer.testBocchan(TestKuromojiTokenizer.java:528) ... I believe you can reproduce this by changing TestKuromojiTokenizer.doTestBocchan() to read the attached SolrXml-5498.xml instead of bocchan.utf-8 .
          Hide
          Michael McCandless added a comment -

          Egads, I'll dig... thanks Christian.

          Show
          Michael McCandless added a comment - Egads, I'll dig... thanks Christian.
          Hide
          Michael McCandless added a comment -

          New patch, fixing the exc Christian found: the 2nd best search was corrupting the bestIDX on the backtrace in the case where a compound wasn't selected.

          I also set a limit on the max UNK word length, and pulled the limits into static final private constants.

          I was able to parse the 2012/02/20 jaenwiki export with this (see commented out test case in TestKuromojiTokenizer). I think it's ready (again!).

          Show
          Michael McCandless added a comment - New patch, fixing the exc Christian found: the 2nd best search was corrupting the bestIDX on the backtrace in the case where a compound wasn't selected. I also set a limit on the max UNK word length, and pulled the limits into static final private constants. I was able to parse the 2012/02/20 jaenwiki export with this (see commented out test case in TestKuromojiTokenizer). I think it's ready (again!).
          Hide
          Christian Moen added a comment -

          Thanks, Mike.

          I've tried the latest patch and things look fine here now. I haven't had a chance to review the code changes yet, but I'm happy to do that over the next couple of days and commit this – unless you'd like to do that yourself. (This would be my first commit.)

          Show
          Christian Moen added a comment - Thanks, Mike. I've tried the latest patch and things look fine here now. I haven't had a chance to review the code changes yet, but I'm happy to do that over the next couple of days and commit this – unless you'd like to do that yourself. (This would be my first commit.)
          Hide
          Michael McCandless added a comment -

          Hi Christian, sure feel free to commit this! Thanks

          Show
          Michael McCandless added a comment - Hi Christian, sure feel free to commit this! Thanks
          Hide
          Christian Moen added a comment -

          Thanks, Mike. I'll commit this to trunk tomorrow and backport to branch_3x as well.

          Show
          Christian Moen added a comment - Thanks, Mike. I'll commit this to trunk tomorrow and backport to branch_3x as well.
          Hide
          Michael McCandless added a comment -

          Thanks Christian!

          Show
          Michael McCandless added a comment - Thanks Christian!
          Hide
          Robert Muir added a comment -

          +1

          Show
          Robert Muir added a comment - +1
          Hide
          Christian Moen added a comment -

          Committed to trunk with revision 1296805.

          Will backport to branch_3x and then mark as fixed.

          Show
          Christian Moen added a comment - Committed to trunk with revision 1296805. Will backport to branch_3x and then mark as fixed.
          Hide
          Christian Moen added a comment -

          I've attached a patch for branch_3x and I'd be very thankful if Robert or Mike could have a quick look before I commit.

          These are some comments and questions:

          1. I haven't backported the test for PositionLengthAttribute since the structure seems very different in branch_3x. Perhaps this is something we could look to as we start using PositionLength in more of the other analyzers?
          2. There will be some minor discrepancies between trunk and branch_3x with regards to Set<?> and CharArraySet types. Perhaps we should use CharArraySet on across the board here in the future?
          3. Does the svn:mergeinfo look okay in the patch?

          Tests pass on branch_3x and I've verified that the new feature seems fine in a Solr build as well.

          Show
          Christian Moen added a comment - I've attached a patch for branch_3x and I'd be very thankful if Robert or Mike could have a quick look before I commit. These are some comments and questions: I haven't backported the test for PositionLengthAttribute since the structure seems very different in branch_3x . Perhaps this is something we could look to as we start using PositionLength in more of the other analyzers? There will be some minor discrepancies between trunk and branch_3x with regards to Set<?> and CharArraySet types. Perhaps we should use CharArraySet on across the board here in the future? Does the svn:mergeinfo look okay in the patch? Tests pass on branch_3x and I've verified that the new feature seems fine in a Solr build as well.
          Hide
          Michael McCandless added a comment -

          I haven't backported the test for PositionLengthAttribute since the structure seems very different in branch_3x. Perhaps this is something we could look to as we start using PositionLength in more of the other analyzers?

          Hmm you mean TestSimpleAttributeImpl? I think that's fine. We'll
          improve testing of pos length as we make more tokenizers/filters
          graphs...

          Does the svn:mergeinfo look okay in the patch?

          Hmm looks spooky... there was a trick for this... (and I know it's
          hard to merge back since the path changed in trunk). Maybe do svn
          propdel svn:mergeinfo on all affected files, and then at the top level
          you can just do an "svn merge --record-only" to update toplevel
          mergeinfo?

          There will be some minor discrepancies between trunk and branch_3x with regards to Set<?> and CharArraySet types. Perhaps we should use CharArraySet on across the board here in the future?

          This was from LUCENE-3765; I think it's OK to require
          CharArraySet to Kuromoji...

          Looks like the added files are not included in the patch (eg
          RollingCharBuffer)... if you're using svn >= 1.7, you can run "svn
          diff --show-copies-as-adds" maybe? Or, use
          dev-tools/scripts/diffSources.py... but I don't think it's
          really necessary before committing...

          Otherwise patch looks great!

          Show
          Michael McCandless added a comment - I haven't backported the test for PositionLengthAttribute since the structure seems very different in branch_3x. Perhaps this is something we could look to as we start using PositionLength in more of the other analyzers? Hmm you mean TestSimpleAttributeImpl? I think that's fine. We'll improve testing of pos length as we make more tokenizers/filters graphs... Does the svn:mergeinfo look okay in the patch? Hmm looks spooky... there was a trick for this... (and I know it's hard to merge back since the path changed in trunk). Maybe do svn propdel svn:mergeinfo on all affected files, and then at the top level you can just do an "svn merge --record-only" to update toplevel mergeinfo? There will be some minor discrepancies between trunk and branch_3x with regards to Set<?> and CharArraySet types. Perhaps we should use CharArraySet on across the board here in the future? This was from LUCENE-3765 ; I think it's OK to require CharArraySet to Kuromoji... Looks like the added files are not included in the patch (eg RollingCharBuffer)... if you're using svn >= 1.7, you can run "svn diff --show-copies-as-adds" maybe? Or, use dev-tools/scripts/diffSources.py... but I don't think it's really necessary before committing... Otherwise patch looks great!
          Hide
          Robert Muir added a comment -

          Patch looks good: KuromojiAnalyzer can take CharArraySet because its not yet released, the only reason
          its kinda funky in LUCENE-3765 is to avoid backwards compatibility breaks in 3.x.

          Show
          Robert Muir added a comment - Patch looks good: KuromojiAnalyzer can take CharArraySet because its not yet released, the only reason its kinda funky in LUCENE-3765 is to avoid backwards compatibility breaks in 3.x.
          Hide
          Christian Moen added a comment -

          Thanks for the feedback.

          Mike, you are right that the added files aren't included in the patch, but I believe they will be added on check-in. (I've verified that they're marked for addition, including RollingCharBuffer. Sorry for the confusion.)

          I believe the patch is correct with regards to svn:mergeinfo – only the root directory, lucene and solr should have them.

          I'll commit to branch_3x shortly.

          Show
          Christian Moen added a comment - Thanks for the feedback. Mike, you are right that the added files aren't included in the patch, but I believe they will be added on check-in. (I've verified that they're marked for addition, including RollingCharBuffer . Sorry for the confusion.) I believe the patch is correct with regards to svn:mergeinfo – only the root directory, lucene and solr should have them. I'll commit to branch_3x shortly.
          Hide
          Christian Moen added a comment -

          Committed revision 1299213 on branch_3x

          Show
          Christian Moen added a comment - Committed revision 1299213 on branch_3x
          Hide
          Christian Moen added a comment -

          Confirmed this working in a branch_3x build.

          Thanks, Mike and Robert!

          Show
          Christian Moen added a comment - Confirmed this working in a branch_3x build. Thanks, Mike and Robert!

            People

            • Assignee:
              Christian Moen
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development