Lucene - Core
  1. Lucene - Core
  2. LUCENE-4201

Add Japanese character filter to normalize iteration marks

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA, 6.0
    • Fix Version/s: 4.0-BETA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      For some applications it might be useful to normalize kanji and kana iteration marks such as 々, ゞ, ゝ, ヽ and ヾ to make sure they are treated uniformly.

      1. LUCENE-4201.patch
        36 kB
        Christian Moen
      2. LUCENE-4201.patch
        35 kB
        Christian Moen
      3. LUCENE-4201.patch
        33 kB
        Christian Moen
      4. LUCENE-4201.patch
        30 kB
        Robert Muir
      5. LUCENE-4201.patch
        29 kB
        Christian Moen

        Activity

        Hide
        Christian Moen added a comment -

        Patch attached.

        Show
        Christian Moen added a comment - Patch attached.
        Hide
        Christian Moen added a comment -

        Sequences of iteration marks are supported. In case an illegal sequence of iteration marks is encountered, the implementation emits the illegal source character as-is without considering its script. For example, with input "?ゝ", we get "??" even though "?" isn't hiragana.

        Note that a full stop punctuation character "。" (U+3002) can not be iterated (see below). Iteration marks themselves can be emitted in case they are illegal, i.e. if they go back past the beginning of the character stream.

        The implementation buffers input until a full stop punctuation character (U+3002) or EOF is reached in order to not keep a copy of the character stream in memory. Vertical iteration marks, which are even rarer than horizonal iteration marks in contemporary Japanese, are unsupported.

        Show
        Christian Moen added a comment - Sequences of iteration marks are supported. In case an illegal sequence of iteration marks is encountered, the implementation emits the illegal source character as-is without considering its script. For example, with input "?ゝ", we get "??" even though "?" isn't hiragana. Note that a full stop punctuation character "。" (U+3002) can not be iterated (see below). Iteration marks themselves can be emitted in case they are illegal, i.e. if they go back past the beginning of the character stream. The implementation buffers input until a full stop punctuation character (U+3002) or EOF is reached in order to not keep a copy of the character stream in memory. Vertical iteration marks, which are even rarer than horizonal iteration marks in contemporary Japanese, are unsupported.
        Hide
        Christian Moen added a comment -

        I've indexed the Japanese Wikipedia using this filter and things look okay. I'm seeing a ~8% performance overhead (versus no filter).

        My thinking is that this filter should be available for applications that need it, but it should not be part of our default Japanese configuration.

        Show
        Christian Moen added a comment - I've indexed the Japanese Wikipedia using this filter and things look okay. I'm seeing a ~8% performance overhead (versus no filter). My thinking is that this filter should be available for applications that need it, but it should not be part of our default Japanese configuration.
        Hide
        Robert Muir added a comment -

        updated patch with some additional tests: for CharFilters its useful to use MockTokenizer + checkRandomData because MockTokenizer has a lot of asserts.

        This fails sometimes: the first one i hit was the valid unicode assert in MockTokenizer, I think sometimes we might be doubling a high or low surrogate? a simple workaround would be to never double surrogates.

        Show
        Robert Muir added a comment - updated patch with some additional tests: for CharFilters its useful to use MockTokenizer + checkRandomData because MockTokenizer has a lot of asserts. This fails sometimes: the first one i hit was the valid unicode assert in MockTokenizer, I think sometimes we might be doubling a high or low surrogate? a simple workaround would be to never double surrogates.
        Hide
        Robert Muir added a comment -

        I've indexed the Japanese Wikipedia using this filter and things look okay. I'm seeing a ~8% performance overhead (versus no filter).

        Beware of LUCENE-4185 here, it might be not so bad (i just put a patch up there)

        Show
        Robert Muir added a comment - I've indexed the Japanese Wikipedia using this filter and things look okay. I'm seeing a ~8% performance overhead (versus no filter). Beware of LUCENE-4185 here, it might be not so bad (i just put a patch up there)
        Hide
        Robert Muir added a comment -

        Also do we need to worry about offsets tests? Does this filter need to do any offsets corrections (it seems it does not, which would be nice)

        Show
        Robert Muir added a comment - Also do we need to worry about offsets tests? Does this filter need to do any offsets corrections (it seems it does not, which would be nice)
        Hide
        Christian Moen added a comment -

        Thanks a lot, Robert.

        I'll looking into the random checks and a couple of other things as well.

        Show
        Christian Moen added a comment - Thanks a lot, Robert. I'll looking into the random checks and a couple of other things as well.
        Hide
        Christian Moen added a comment -

        We shouldn't need any offset corrections since we never add or remove characters (we just replace them).

        Show
        Christian Moen added a comment - We shouldn't need any offset corrections since we never add or remove characters (we just replace them).
        Hide
        Robert Muir added a comment -

        I thought that might be the case: when i first wrote the tests i used japaneseAnalyzer
        and they always passed... So I think this is just the one corner case that MockTokenizer finds.

        Not correcting offsets keeps things simple: so if possible I think we could just not do
        anything with iteration marks + surrogates and leave as-is, otherwise to actually
        replace the iteration mark with those, we would need offsets corrections.

        Show
        Robert Muir added a comment - I thought that might be the case: when i first wrote the tests i used japaneseAnalyzer and they always passed... So I think this is just the one corner case that MockTokenizer finds. Not correcting offsets keeps things simple: so if possible I think we could just not do anything with iteration marks + surrogates and leave as-is, otherwise to actually replace the iteration mark with those, we would need offsets corrections.
        Hide
        Christian Moen added a comment -

        Thanks, Robert.

        I've attached a new patch that deals with surrogates and I've also fixed a couple of others issues found by further testing.

        Show
        Christian Moen added a comment - Thanks, Robert. I've attached a new patch that deals with surrogates and I've also fixed a couple of others issues found by further testing.
        Hide
        Christian Moen added a comment -

        Added additional Solr factory tests to test parameters. I think it's ready.

        Show
        Christian Moen added a comment - Added additional Solr factory tests to test parameters. I think it's ready.
        Hide
        Robert Muir added a comment -

        patch looks great. +1 to commit

        Show
        Robert Muir added a comment - patch looks great. +1 to commit
        Hide
        Christian Moen added a comment -

        Thanks, Robert. Attached final patch with CHANGES.txt details.

        Show
        Christian Moen added a comment - Thanks, Robert. Attached final patch with CHANGES.txt details.
        Hide
        Christian Moen added a comment -

        Committed revision 1359613 on trunk

        Show
        Christian Moen added a comment - Committed revision 1359613 on trunk
        Hide
        Christian Moen added a comment -

        Added svn:eol-style native to trunk with revision 1359632.

        Show
        Christian Moen added a comment - Added svn:eol-style native to trunk with revision 1359632.
        Hide
        Christian Moen added a comment -

        Committed revision 1359645 on branch_4x

        Show
        Christian Moen added a comment - Committed revision 1359645 on branch_4x
        Hide
        Hoss Man added a comment -

        hoss20120711-manual-post-40alpha-change

        Show
        Hoss Man added a comment - hoss20120711-manual-post-40alpha-change
        Hide
        Hoss Man added a comment -

        bah .. wrong textbox

        Show
        Hoss Man added a comment - bah .. wrong textbox

          People

          • Assignee:
            Christian Moen
            Reporter:
            Christian Moen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development