Lucene - Core
  1. Lucene - Core
  2. LUCENE-3940

When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA, 3.6.1
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I modified BaseTokenStreamTestCase to assert that the start/end
      offsets match for graph (posLen > 1) tokens, and this caught a bug in
      Kuromoji when the decompounding of a compound token has a punctuation
      token that's dropped.

      In this case we should leave hole(s) so that the graph is intact, ie,
      the graph should look the same as if the punctuation tokens were not
      initially removed, but then a StopFilter had removed them.

      This also affects tokens that have no compound over them, ie we fail
      to leave a hole today when we remove the punctuation tokens.

      I'm not sure this is serious enough to warrant fixing in 3.6 at the
      last minute...

      1. LUCENE-3940.patch
        67 kB
        Michael McCandless
      2. LUCENE-3940.patch
        70 kB
        Michael McCandless
      3. LUCENE-3940.patch
        67 kB
        Michael McCandless
      4. LUCENE-3940.patch
        56 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch; I think it's ready, except, I need to make a simpler test case than that new curious string....

        Show
        Michael McCandless added a comment - Patch; I think it's ready, except, I need to make a simpler test case than that new curious string....
        Hide
        Michael McCandless added a comment -

        New patch, fixing a bug in the last one, and adding a few more test cases. I also made the "print curious string on exception" from BTSTC more ascii-friendly.

        I think it's ready.

        Show
        Michael McCandless added a comment - New patch, fixing a bug in the last one, and adding a few more test cases. I also made the "print curious string on exception" from BTSTC more ascii-friendly. I think it's ready.
        Hide
        Robert Muir added a comment -

        I dont think we should do this. StandardTokenizer doesnt leave holes when it drops punctuation,
        I think holes should only be real 'words' for the most part

        Show
        Robert Muir added a comment - I dont think we should do this. StandardTokenizer doesnt leave holes when it drops punctuation, I think holes should only be real 'words' for the most part
        Hide
        Michael McCandless added a comment -

        StandardTokenizer doesnt leave holes when it drops punctuation,

        But is that really good?

        This means a PhraseQuery will match across end-of-sentence (.),
        semicolon, colon, comma, etc. (English examples..).

        I think tokenizers should throw away as little information as
        possible... we can always filter out such tokens in a later stage?

        For example, if a tokenizer created punct tokens (instead of silently
        discarding them), other token filters could make use of them in the
        mean time, eg a synonym rule for "u.s.a. -> usa" or maybe a dedicated
        English "acronyms" filter. We could then later filter them out, even
        not leaving holes, and have the same behavior that we have now?

        Are there non-English examples where you would want the PhraseQuery to
        match over punctuation...? EG, for Japanese, I assume we don't want
        PhraseQuery applying across periods/commas, like it will now? (Not
        sure about middle dot...? Others...?).

        Show
        Michael McCandless added a comment - StandardTokenizer doesnt leave holes when it drops punctuation, But is that really good? This means a PhraseQuery will match across end-of-sentence (.), semicolon, colon, comma, etc. (English examples..). I think tokenizers should throw away as little information as possible... we can always filter out such tokens in a later stage? For example, if a tokenizer created punct tokens (instead of silently discarding them), other token filters could make use of them in the mean time, eg a synonym rule for "u.s.a. -> usa" or maybe a dedicated English "acronyms" filter. We could then later filter them out, even not leaving holes, and have the same behavior that we have now? Are there non-English examples where you would want the PhraseQuery to match over punctuation...? EG, for Japanese, I assume we don't want PhraseQuery applying across periods/commas, like it will now? (Not sure about middle dot...? Others...?).
        Hide
        Michael McCandless added a comment -

        New test-only patch, breaking out the non-controversial (I think!)
        part of the patch.

        With this new patch, Kuromoji still silently discards punctuation
        (just like StandardAnalyzer), but at least we get better test coverage
        in BTSTC to verify graph tokens are not messing up their offsets.

        I had to turn it off when testing Kuromoji w/ punctuation
        removal... but it's still tested w/o punctuation removal, so I think
        it'd likely catch any bugs in how Kuromoji sets offsets of the
        compound tokens... at least it's better than not checking at all
        (ie, today).

        The only non-tests-only change is I uncommented an assert in Kuromoji;
        I think it's a valid assert.

        Show
        Michael McCandless added a comment - New test-only patch, breaking out the non-controversial (I think!) part of the patch. With this new patch, Kuromoji still silently discards punctuation (just like StandardAnalyzer), but at least we get better test coverage in BTSTC to verify graph tokens are not messing up their offsets. I had to turn it off when testing Kuromoji w/ punctuation removal... but it's still tested w/o punctuation removal, so I think it'd likely catch any bugs in how Kuromoji sets offsets of the compound tokens... at least it's better than not checking at all (ie, today). The only non-tests-only change is I uncommented an assert in Kuromoji; I think it's a valid assert.
        Hide
        Robert Muir added a comment -

        I think kuromoji has the problem: here it creates 'fake' intermediate 'tokens' and then deletes them and this somehow screws up the decompounding graph???

        it should never create these tokens in the first place! I think its well accepted that words carry the information content of a doc, punctuation has no information content really here, it doesn't tell me what the doc is about, and I don't think this is controversial, I just think your view on this is extreme...

        Show
        Robert Muir added a comment - I think kuromoji has the problem: here it creates 'fake' intermediate 'tokens' and then deletes them and this somehow screws up the decompounding graph??? it should never create these tokens in the first place! I think its well accepted that words carry the information content of a doc, punctuation has no information content really here, it doesn't tell me what the doc is about, and I don't think this is controversial, I just think your view on this is extreme...
        Hide
        Michael McCandless added a comment -

        Here's an example where we create a compound token with punctuation.

        I got this from the Japanese Wikipedia export, with our MockCharFilter
        sometimes doubling characters: we are at a position that the
        characters 〇〇'''、''' after it... that 〇 is this Unicode character
        http://www.fileformat.info/info/unicode/char/3007/index.htm

        When Kuromoji extends from this position, both 〇 and 〇〇 are KNOWN,
        but then we also extend by unknown 〇〇'''、''' (ie, 〇〇 plus only
        punctuation):

        Note that 〇 is not considered punctuation by Kuromoji's isPunctuation
        method...

          + UNKNOWN word 〇〇'''、''' toPos=41 cost=21223 penalty=3400 toPos.idx=0
          + KNOWN word 〇〇 toPos=34 cost=9895 penalty=0 toPos.idx=0
          + KNOWN word 〇 toPos=33 cost=2766 penalty=0 toPos.idx=0
          + KNOWN word 〇 toPos=33 cost=5256 penalty=0 toPos.idx=1
        

        And then on backtrace we make a compound token (UNKNOWN) for all of
        "〇〇'''、'''", while the decompounded path keeps two separate "〇"
        tokens but drops the '''、''' since it's all punctuation, thus
        creating inconsistent offsets.

        Show
        Michael McCandless added a comment - Here's an example where we create a compound token with punctuation. I got this from the Japanese Wikipedia export, with our MockCharFilter sometimes doubling characters: we are at a position that the characters 〇〇'''、''' after it... that 〇 is this Unicode character http://www.fileformat.info/info/unicode/char/3007/index.htm When Kuromoji extends from this position, both 〇 and 〇〇 are KNOWN, but then we also extend by unknown 〇〇'''、''' (ie, 〇〇 plus only punctuation): Note that 〇 is not considered punctuation by Kuromoji's isPunctuation method... + UNKNOWN word 〇〇'''、''' toPos=41 cost=21223 penalty=3400 toPos.idx=0 + KNOWN word 〇〇 toPos=34 cost=9895 penalty=0 toPos.idx=0 + KNOWN word 〇 toPos=33 cost=2766 penalty=0 toPos.idx=0 + KNOWN word 〇 toPos=33 cost=5256 penalty=0 toPos.idx=1 And then on backtrace we make a compound token (UNKNOWN) for all of "〇〇'''、'''", while the decompounded path keeps two separate "〇" tokens but drops the '''、''' since it's all punctuation, thus creating inconsistent offsets.
        Hide
        Steve Rowe added a comment -

        I think its well accepted that words carry the information content of a doc, punctuation has no information content really here, it doesn't tell me what the doc is about, and I don't think this is controversial, I just think your view on this is extreme...

        I disagree with you, Robert. (If punctuation has no information content, why does it exist?) IMHO Mike's examples are not at all extreme, e.g. some punctuation tokens could be used to trigger position increment gaps.

        StandardTokenizer doesnt leave holes when it drops punctuation, I think holes should only be real 'words' for the most part

        "Standard"Tokenizer is drawn from Unicode UAX#29, which only describes word boundaries. Lucene has grafted onto these boundary rules an assumption that only alphanumeric "words" should be tokens - this assumption does not exist in the standard itself.

        My opinion is that we should have both types of things: a tokenizer that discards non-alphanumeric characters between word boundaries; and different type of analysis component that discards nothing. I think of the discard-nothing process as segmentation rather than tokenization, and I've argued for it previously.

        Show
        Steve Rowe added a comment - I think its well accepted that words carry the information content of a doc, punctuation has no information content really here, it doesn't tell me what the doc is about, and I don't think this is controversial, I just think your view on this is extreme... I disagree with you, Robert. (If punctuation has no information content, why does it exist?) IMHO Mike's examples are not at all extreme, e.g. some punctuation tokens could be used to trigger position increment gaps. StandardTokenizer doesnt leave holes when it drops punctuation, I think holes should only be real 'words' for the most part "Standard"Tokenizer is drawn from Unicode UAX#29, which only describes word boundaries . Lucene has grafted onto these boundary rules an assumption that only alphanumeric "words" should be tokens - this assumption does not exist in the standard itself. My opinion is that we should have both types of things: a tokenizer that discards non-alphanumeric characters between word boundaries; and different type of analysis component that discards nothing. I think of the discard-nothing process as segmentation rather than tokenization, and I've argued for it previously .
        Hide
        Michael McCandless added a comment -

        OK here's one possible fix...

        Right now, when we are glomming up an UNKNOWN token, we glom only as long as the character class of each character is the same as the first character.

        What if we also require that isPunct-ness is the same? That way we would never create an UNKNOWN token mixing punct and non-punct...

        I implemented that and the tests seem to pass w/ offset checking fully turned on again...

        Show
        Michael McCandless added a comment - OK here's one possible fix... Right now, when we are glomming up an UNKNOWN token, we glom only as long as the character class of each character is the same as the first character. What if we also require that isPunct-ness is the same? That way we would never create an UNKNOWN token mixing punct and non-punct... I implemented that and the tests seem to pass w/ offset checking fully turned on again...
        Hide
        Michael McCandless added a comment -

        New patch, implementing the idea above. I think it's ready...

        It also means we can unconditionally check for correct offsets in
        graph tokens, which is nice.

        Show
        Michael McCandless added a comment - New patch, implementing the idea above. I think it's ready... It also means we can unconditionally check for correct offsets in graph tokens, which is nice.
        Hide
        Robert Muir added a comment -

        I disagree with you, Robert. (If punctuation has no information content, why does it exist?) IMHO Mike's examples are not at all extreme, e.g. some punctuation tokens could be used to trigger position increment gaps.

        Punctuation simply doesn't tell you anything about the document: this is fact. if we start indexing punctuation we just create useless terms that go to every document

        Because of this, nobody wastes their time trying to figure out how index "punctuation tokens". Mike's problem is basically the fact he is creating a compound token of '??'

        Furthermore, the idea that 'if we don't leave a hole for anything removed, we are losing formation' is totally arbitrary, confusing, and inconsistent anyway. How come we leave holes for definitiveness in english but not for plurals in english, but in arabic or bulgarian we don't leave holes for definiteness, because it happens to be attached to the word and stemmed?

        Show
        Robert Muir added a comment - I disagree with you, Robert. (If punctuation has no information content, why does it exist?) IMHO Mike's examples are not at all extreme, e.g. some punctuation tokens could be used to trigger position increment gaps. Punctuation simply doesn't tell you anything about the document: this is fact. if we start indexing punctuation we just create useless terms that go to every document Because of this, nobody wastes their time trying to figure out how index "punctuation tokens". Mike's problem is basically the fact he is creating a compound token of '??' Furthermore, the idea that 'if we don't leave a hole for anything removed, we are losing formation' is totally arbitrary, confusing, and inconsistent anyway. How come we leave holes for definitiveness in english but not for plurals in english, but in arabic or bulgarian we don't leave holes for definiteness, because it happens to be attached to the word and stemmed?
        Hide
        Christian Moen added a comment -

        I'm not familiar with the various considerations that were made with StandardTokenizer, but please allow me to share some comments anyway.

        Perhaps it's useful to distinguish between analysis for information retrieval and analysis for information extraction here?

        I like Michael's and Steven's idea of doing tokenization that doesn't discard any information. This is certainly useful in the case of information extraction. For example, if we'd like to extract noun-phrases based on part-of-speech tags, we don't want to conjoin tokens in case there's a punctuation character between two nouns (unless that punctuation character is a middle dot).

        Robert is of course correct that we generally don't want to index punctuation characters that occur in every document, so from an information retrieval point of view, we'd like punctuation characters removed.

        If there's an established convention that Tokenizer variants discards punctuation and produces the terms that are meant to be directly searchable, it sounds like a good idea that we stick to the convention here as well.

        If there's no established convention, it seems useful that a Tokenizer would provide as much details as possible with text being input and leave downstream Filters/Analyzers to remove whatever is suitable based on a particular processing purpose. We can provide common ready-to-use Analyzers with reasonable defaults that users can look to, i.e. to process a specific language or do another common high-level task with text.

        Hence, perhaps each Tokenizer can decide what makes the most sense to do based on that particular tokenizer's scope of processing?

        To Roberts point, this would leave processing totally arbitrary and consistent, but this would be by design as it wouldn't be Tokenizer's role to enforce any overall consistency – i.e. with regards to punctuation – higher level Analyzers would provide that.

        Thoughts?

        Show
        Christian Moen added a comment - I'm not familiar with the various considerations that were made with StandardTokenizer, but please allow me to share some comments anyway. Perhaps it's useful to distinguish between analysis for information retrieval and analysis for information extraction here? I like Michael's and Steven's idea of doing tokenization that doesn't discard any information. This is certainly useful in the case of information extraction . For example, if we'd like to extract noun-phrases based on part-of-speech tags, we don't want to conjoin tokens in case there's a punctuation character between two nouns (unless that punctuation character is a middle dot). Robert is of course correct that we generally don't want to index punctuation characters that occur in every document, so from an information retrieval point of view, we'd like punctuation characters removed. If there's an established convention that Tokenizer variants discards punctuation and produces the terms that are meant to be directly searchable, it sounds like a good idea that we stick to the convention here as well. If there's no established convention, it seems useful that a Tokenizer would provide as much details as possible with text being input and leave downstream Filters/Analyzers to remove whatever is suitable based on a particular processing purpose. We can provide common ready-to-use Analyzers with reasonable defaults that users can look to, i.e. to process a specific language or do another common high-level task with text. Hence, perhaps each Tokenizer can decide what makes the most sense to do based on that particular tokenizer's scope of processing? To Roberts point, this would leave processing totally arbitrary and consistent, but this would be by design as it wouldn't be Tokenizer's role to enforce any overall consistency – i.e. with regards to punctuation – higher level Analyzers would provide that. Thoughts?
        Hide
        Robert Muir added a comment -

        Perhaps it's useful to distinguish between analysis for information retrieval and analysis for information extraction here?

        Yes, since we are an information retrieval library, then there is no sense in adding traps and complexity to our analysis API.

        I like Michael's and Steven's idea of doing tokenization that doesn't discard any information.

        For IR, this is definitely not information... calling it data is a stretch.

        If there's an established convention that Tokenizer variants discards punctuation and produces the terms that are meant to be directly searchable, it sounds like a good idea that we stick to the convention here as well.

        Thats what the tokenizers do today, they find tokens (In the IR sense). So yeah, there is an established convention already. Changing
        this would be a monster trap because suddenly tons of people would be indexing tons of useless punctuation. I would strongly
        oppose such a change.

        Show
        Robert Muir added a comment - Perhaps it's useful to distinguish between analysis for information retrieval and analysis for information extraction here? Yes, since we are an information retrieval library, then there is no sense in adding traps and complexity to our analysis API. I like Michael's and Steven's idea of doing tokenization that doesn't discard any information. For IR, this is definitely not information... calling it data is a stretch. If there's an established convention that Tokenizer variants discards punctuation and produces the terms that are meant to be directly searchable, it sounds like a good idea that we stick to the convention here as well. Thats what the tokenizers do today, they find tokens (In the IR sense). So yeah, there is an established convention already. Changing this would be a monster trap because suddenly tons of people would be indexing tons of useless punctuation. I would strongly oppose such a change.
        Hide
        Robert Muir added a comment -

        This is certainly useful in the case of information extraction. For example, if we'd like to extract noun-phrases based on part-of-speech tags, we don't want to conjoin tokens in case there's a punctuation character between two nouns (unless that punctuation character is a middle dot).

        The option still exists in kuromoji (discardPunctuation=false) if you want to use it for this.
        I added this option because it originally kept the punctuation (for downstream filters to remove).

        lucene-gosen worked the same way, and after some time i saw numerous examples across the internet
        where people simply configured the tokenizer without any filters, which means huge amounts of
        punctuation being indexed by default. People don't pay attention to documentation or details,
        so all these people were getting bad performance.

        Based on this experience, I didn't want keeping punctuation to be the default, nor even achievable
        via things like solr factories here. But i added the (expert) option to Kuromoji because its really
        a more general purpose things for japanese analysis, because its already being used for other things,
        and because allowing a boolean was not expensive or complex to support.

        But I don't consider this a bonafied option from the lucene apis, i would be strongly against adding
        this to the solr factories, or as an option to KuromojiAnalyzer. And, I don't think we should add such
        a thing to other tokenizers either.

        Our general mission is search, if we want to decide we are expanding our analysis API to be generally
        useful outside of information retrieval, thats a much bigger decision with more complex tradeoffs that
        everyone should vote on (e.g. moving analyzers outside of lucene.apache.org and everything).

        Show
        Robert Muir added a comment - This is certainly useful in the case of information extraction. For example, if we'd like to extract noun-phrases based on part-of-speech tags, we don't want to conjoin tokens in case there's a punctuation character between two nouns (unless that punctuation character is a middle dot). The option still exists in kuromoji (discardPunctuation=false) if you want to use it for this. I added this option because it originally kept the punctuation (for downstream filters to remove). lucene-gosen worked the same way, and after some time i saw numerous examples across the internet where people simply configured the tokenizer without any filters, which means huge amounts of punctuation being indexed by default. People don't pay attention to documentation or details, so all these people were getting bad performance. Based on this experience, I didn't want keeping punctuation to be the default, nor even achievable via things like solr factories here. But i added the (expert) option to Kuromoji because its really a more general purpose things for japanese analysis, because its already being used for other things, and because allowing a boolean was not expensive or complex to support. But I don't consider this a bonafied option from the lucene apis, i would be strongly against adding this to the solr factories, or as an option to KuromojiAnalyzer. And, I don't think we should add such a thing to other tokenizers either. Our general mission is search, if we want to decide we are expanding our analysis API to be generally useful outside of information retrieval, thats a much bigger decision with more complex tradeoffs that everyone should vote on (e.g. moving analyzers outside of lucene.apache.org and everything).
        Hide
        Uwe Schindler added a comment -

        Bulk close for 3.6.1

        Show
        Uwe Schindler added a comment - Bulk close for 3.6.1

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development