Lucene - Core
  1. Lucene - Core
  2. LUCENE-1528

Add support for Ideographic Space to the queryparser - also know as fullwith space and wide-space

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The Ideographic Space is a space character that is as wide as a normal CJK character cell.
      It is also known as wide-space or fullwith space.This type of space is used in CJK languages.

      This patch adds support for the wide space, making the queryparser component more friendly
      to queries that contain CJK text.

      Reference:
      'http://en.wikipedia.org/wiki/Space_(punctuation)' - see Table of spaces, char U+3000.

      I also added a new testcase that fails before the patch.
      After the patch is applied all junits pass.

        Activity

        Hide
        Luis Alves added a comment -

        LUCENE-1528 - Add support for Ideographic Space to the queryparser

        Show
        Luis Alves added a comment - LUCENE-1528 - Add support for Ideographic Space to the queryparser
        Hide
        Michael Busch added a comment -

        Looks good, Luis!

        I was just wondering if you can do something like the following to avoid defining the whitespace chars in two places:

        | <#_WHITESPACE: ( " " | "\t" | "\n" | "\r") >
        | <#_TERM_START_CHAR: ( ~( <_WHITESPACE> | [ "+", "-", "!", "(", ")", ":", "^",
                                   "[", "]", "\"", "{", "}", "~", "*", "?", "\\" ])
                               | <_ESCAPED_CHAR> ) >
        

        This does not compile... is there another way to achieve this in javacc?
        If not, it's not a big deal and I can commit this patch as is.

        Show
        Michael Busch added a comment - Looks good, Luis! I was just wondering if you can do something like the following to avoid defining the whitespace chars in two places: | <#_WHITESPACE: ( " " | "\t" | "\n" | "\r") > | <#_TERM_START_CHAR: ( ~( <_WHITESPACE> | [ "+", "-", "!", "(", ")", ":", "^", "[", "]", "\"", "{", "}", "~", "*", "?", "\\" ]) | <_ESCAPED_CHAR> ) > This does not compile... is there another way to achieve this in javacc? If not, it's not a big deal and I can commit this patch as is.
        Hide
        Luis Alves added a comment -

        Hi Michael,

        I checked the book "Generating parser with JavaCC" and I checked the javacc website (https://javacc.dev.java.net/doc/javaccgrm.html)
        for grammar, here is the syntax for a character list:

        character_list ::= [ "~" ] "[" [ character_descriptor ( "," character_descriptor )* ] "]"
        character_descriptor ::= java_string_literal [ "-" java_string_literal ]

        also the '|' character in javacc syntax is used like an XOR, and there is no OR or AND operator to be used in the javacc syntax that I'm aware.
        So the expression <_WHITESPACE> | [ "", ... ] would have to look like ~(<_WHITESPACE> & [ "", ... ]) but this is not possible in javacc grammar.

        So I think the best option for now, is to keep the current syntax.

        If you like, I can change

        <#_WHITESPACE: ( " " | "\t" | "\n" | "\r") >

        to a character_list to make it more consistent, but that would not help to remove the duplicated list of characters.

        <#_WHITESPACE: [ " ", "\t", "\n", "\r" ] >

        Show
        Luis Alves added a comment - Hi Michael, I checked the book "Generating parser with JavaCC" and I checked the javacc website ( https://javacc.dev.java.net/doc/javaccgrm.html ) for grammar, here is the syntax for a character list: character_list ::= [ "~" ] "[" [ character_descriptor ( "," character_descriptor )* ] "]" character_descriptor ::= java_string_literal [ "-" java_string_literal ] also the '|' character in javacc syntax is used like an XOR, and there is no OR or AND operator to be used in the javacc syntax that I'm aware. So the expression <_WHITESPACE> | [ " ", ... ] would have to look like ~(<_WHITESPACE> & [ " ", ... ]) but this is not possible in javacc grammar. So I think the best option for now, is to keep the current syntax. If you like, I can change <#_WHITESPACE: ( " " | "\t" | "\n" | "\r") > to a character_list to make it more consistent, but that would not help to remove the duplicated list of characters. <#_WHITESPACE: [ " ", "\t", "\n", "\r" ] >
        Hide
        Michael Busch added a comment -

        Committed revision 738592.

        Thanks, Luis.

        Show
        Michael Busch added a comment - Committed revision 738592. Thanks, Luis.

          People

          • Assignee:
            Michael Busch
            Reporter:
            Luis Alves
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 4h
              4h
              Remaining:
              Remaining Estimate - 4h
              4h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development