Solr
  1. Solr
  2. SOLR-3524

Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.6
    • Fix Version/s: 4.0-BETA, 6.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation in Japanese text, although It has a parameter to change this behavior. JapaneseTokenizerFactory always set third parameter, which controls this behavior, to true to remove punctuation.
      I would like to have an option I can configure this behavior by fieldtype definition in schema.xml.

      1. kuromoji_discard_punctuation.patch.txt
        1 kB
        Jun Ohtani
      2. SOLR-3524.patch
        5 kB
        Christian Moen
      3. SOLR-3524.patch
        5 kB
        Christian Moen

        Activity

        Hide
        Jun Ohtani added a comment -

        create patch.
        But no test implement.

        Show
        Jun Ohtani added a comment - create patch. But no test implement.
        Hide
        Christian Moen added a comment -

        Hiraga-san, there are different views on how punctuation characters best are handled by tokenizers. Punctuation characters generally don't convey much meaning useful for text search, so they are generally removed in Lucene. (A different point of view is that tokenizers shouldn't remove punctuations and that filters should do this.)

        The ability to keep punctuation was left as an expert-feature in JapanseTokenizer and I think we can expose this as an expert feature in Solr as well. Could you share some details on your use-case just so that I get a better idea of the background and importance of this?

        Show
        Christian Moen added a comment - Hiraga-san, there are different views on how punctuation characters best are handled by tokenizers. Punctuation characters generally don't convey much meaning useful for text search, so they are generally removed in Lucene. (A different point of view is that tokenizers shouldn't remove punctuations and that filters should do this.) The ability to keep punctuation was left as an expert-feature in JapanseTokenizer and I think we can expose this as an expert feature in Solr as well. Could you share some details on your use-case just so that I get a better idea of the background and importance of this?
        Hide
        Christian Moen added a comment -

        Ohtani-san, thanks for the patch!

        I've tried it on trunk and applying it fails because of an InitializationException is thrown instead of a SolrException. I'll correct this shortly.

        We also need some tests here...

        Show
        Christian Moen added a comment - Ohtani-san, thanks for the patch! I've tried it on trunk and applying it fails because of an InitializationException is thrown instead of a SolrException . I'll correct this shortly. We also need some tests here...
        Hide
        Jun Ohtani added a comment -

        Hi Christian,

        Sorry, I create the patch based ver. 3.6.0.

        Show
        Jun Ohtani added a comment - Hi Christian, Sorry, I create the patch based ver. 3.6.0.
        Hide
        Christian Moen added a comment -

        No trouble. I'll provide a new patch shortly for trunk and branch_4x with a test as well.

        Show
        Christian Moen added a comment - No trouble. I'll provide a new patch shortly for trunk and branch_4x with a test as well.
        Hide
        Christian Moen added a comment -

        New patch with tests and documentation changes attached.

        Show
        Christian Moen added a comment - New patch with tests and documentation changes attached.
        Hide
        Kazuaki Hiraga added a comment -

        Thank you guys!
        Christian, Since some documents have keywords that consists of alphabet and punctuation such as c++, c# and so on, We want to match those keywords with the keyword that unchanged form. Of course, we will discard punctuation in many cases but some cases, especially short text, we want to preserve punctuation. Therefore, I want to have an option that I can control this behaviour.

        Ohtani-san, thank you for your early reply and patch!

        Show
        Kazuaki Hiraga added a comment - Thank you guys! Christian, Since some documents have keywords that consists of alphabet and punctuation such as c++, c# and so on, We want to match those keywords with the keyword that unchanged form. Of course, we will discard punctuation in many cases but some cases, especially short text, we want to preserve punctuation. Therefore, I want to have an option that I can control this behaviour. Ohtani-san, thank you for your early reply and patch!
        Hide
        Christian Moen added a comment -

        I'll commit this to trunk and branch_4x soon.

        Show
        Christian Moen added a comment - I'll commit this to trunk and branch_4x soon.
        Hide
        Christian Moen added a comment -

        Patch updated due to recent configuration changes.

        Show
        Christian Moen added a comment - Patch updated due to recent configuration changes.
        Hide
        Christian Moen added a comment -

        Committed revision 1360592 on trunk

        Show
        Christian Moen added a comment - Committed revision 1360592 on trunk
        Hide
        Christian Moen added a comment -

        Committed revision 1360613 on branch_4x

        Show
        Christian Moen added a comment - Committed revision 1360613 on branch_4x
        Hide
        Christian Moen added a comment -

        Thanks, Kazu and Ohtani-san!

        Show
        Christian Moen added a comment - Thanks, Kazu and Ohtani-san!
        Hide
        Christian Moen added a comment -

        CHANGES.txt for some reason didn't make it into branch_4x. Fixed this in revision 1360622.

        Show
        Christian Moen added a comment - CHANGES.txt for some reason didn't make it into branch_4x . Fixed this in revision 1360622.

          People

          • Assignee:
            Christian Moen
            Reporter:
            Kazuaki Hiraga
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development