Solr
  1. Solr
  2. SOLR-1204

Enhance SpellingQueryConverter to handle UTF-8 instead of ASCII only

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: spellchecker
    • Labels:
      None

      Description

      Solr - User - SpellCheckComponent: queryAnalyzerFieldType
      http://www.nabble.com/SpellCheckComponent%3A-queryAnalyzerFieldType-td23870668.html

      In the above thread, it was suggested to extend the SpellingQueryConverter to cover the full UTF-8 range instead of handling US-ASCII only. This might be as simple as changing the regular expression used to tokenize the input string to accept a sequence of one or more Unicode letters ( \p

      {L}

      + ) instead of a sequence of one or more word characters ( \w+ ).

      See http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html for Java regular expression reference.

      1. SpellingQueryConverter.java.diff
        0.5 kB
        Michael Ludwig
      2. SpellingQueryConverter.java.diff
        0.5 kB
        Michael Ludwig

        Issue Links

          Activity

          Hide
          Michael Ludwig added a comment -

          A patch replacing the regular expression used to tokenize the input string as per the description.

          Show
          Michael Ludwig added a comment - A patch replacing the regular expression used to tokenize the input string as per the description.
          Hide
          Shalin Shekhar Mangar added a comment -

          I know people don't usually use non-ASCII characters in field names but shouldn't we replace the
          w, before the colon, too for completeness?

          Show
          Shalin Shekhar Mangar added a comment - I know people don't usually use non-ASCII characters in field names but shouldn't we replace the w, before the colon, too for completeness?
          Hide
          Michael Ludwig added a comment -

          What is a legal field name? I assumed it was ASCII only, starting with a letter or underscore, i.e. a typical symbol name.

          Show
          Michael Ludwig added a comment - What is a legal field name? I assumed it was ASCII only, starting with a letter or underscore, i.e. a typical symbol name.
          Hide
          Shalin Shekhar Mangar added a comment -

          What is a legal field name?

          Any UTF-8 string will do actually. There were discussions on limiting valid field names to valid Java identifiers which are also allowed to have UTF-8 characters.

          Show
          Shalin Shekhar Mangar added a comment - What is a legal field name? Any UTF-8 string will do actually. There were discussions on limiting valid field names to valid Java identifiers which are also allowed to have UTF-8 characters.
          Hide
          Michael Ludwig added a comment -

          Update of the previous patch to also allow any valid UTF-8 character in the field name, not only in the query string.

          Show
          Michael Ludwig added a comment - Update of the previous patch to also allow any valid UTF-8 character in the field name, not only in the query string.
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed revision 781975.

          Thanks Michael!

          Show
          Shalin Shekhar Mangar added a comment - Committed revision 781975. Thanks Michael!
          Hide
          Michael Ludwig added a comment -

          Sorry for being inexact in my last comment. I wrote "any valid UTF-8 character" when I meant "any valid UTF-8 letter", which is what the patch contains, and which isn't equivalent to what I wrote.

          In order to produce a correct patch, I need to know what are legal field names. It can hardly be "any UTF-8 string" as that will also contain the colon, which is already used to delimit field names from query strings. What about digits? Asterisk? Dash (minus)? Underscore? Space? Tabulator?

          My original idea was to allow letters, digits and underscore. But this may not be sufficient.

          Show
          Michael Ludwig added a comment - Sorry for being inexact in my last comment. I wrote "any valid UTF-8 character" when I meant "any valid UTF-8 letter", which is what the patch contains, and which isn't equivalent to what I wrote. In order to produce a correct patch, I need to know what are legal field names. It can hardly be "any UTF-8 string" as that will also contain the colon, which is already used to delimit field names from query strings. What about digits? Asterisk? Dash (minus)? Underscore? Space? Tabulator? My original idea was to allow letters, digits and underscore. But this may not be sufficient.
          Hide
          Robert Muir added a comment -

          hi, michael. Its not for some languages. I recommend using this definition of 'identifier' from UTR #8:

          <identifier> ::= <identifier_start > ( <identifier_start> | <identifier_extend> )*
          <identifier_start> Lu,Ll,Lt,Lm,Lo,Nl Uppercase letter, Lowercase letter, Titlecase letter, Modifier letter, Other letter, Letter number
          <identifier_extend> Mn,Mc,Nd,Pc,Cf Non-spacing mark, Spacing combining mark, Decimal number, Connector punctuation, Formatting code
          <ident_ignorable_char> Cf Formatting code

          Thanks!

          Show
          Robert Muir added a comment - hi, michael. Its not for some languages. I recommend using this definition of 'identifier' from UTR #8: <identifier> ::= <identifier_start > ( <identifier_start> | <identifier_extend> )* <identifier_start> Lu,Ll,Lt,Lm,Lo,Nl Uppercase letter, Lowercase letter, Titlecase letter, Modifier letter, Other letter, Letter number <identifier_extend> Mn,Mc,Nd,Pc,Cf Non-spacing mark, Spacing combining mark, Decimal number, Connector punctuation, Formatting code <ident_ignorable_char> Cf Formatting code Thanks!
          Hide
          Michael Ludwig added a comment -

          Hi Robert, it looks like you're talking about this report:
          http://unicode.org/reports/tr8/

          The XML recommendation might also serve as a point of reference, although both a NAME and a NMTOKEN are probably too liberal in what they accept (the colon among others):
          http://www.w3.org/TR/REC-xml/#NT-Name

          Shalin mentioned Java identifiers, probably without intending to ban names such as "null" and "class":
          http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#40625

          I'm not Unicode-savvy enough to know what's behind all these classes, and I'm not Solr-savvy enough to know which of the punctuation characters that have special meaning in the Solr query language should or must be excluded. So first the spec, then the implementation (regular expression, or some schema technology to validate the field names in schema.xml, or whatever).

          Show
          Michael Ludwig added a comment - Hi Robert, it looks like you're talking about this report: http://unicode.org/reports/tr8/ The XML recommendation might also serve as a point of reference, although both a NAME and a NMTOKEN are probably too liberal in what they accept (the colon among others): http://www.w3.org/TR/REC-xml/#NT-Name Shalin mentioned Java identifiers, probably without intending to ban names such as "null" and "class": http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#40625 I'm not Unicode-savvy enough to know what's behind all these classes, and I'm not Solr-savvy enough to know which of the punctuation characters that have special meaning in the Solr query language should or must be excluded. So first the spec, then the implementation (regular expression, or some schema technology to validate the field names in schema.xml, or whatever).
          Hide
          Robert Muir added a comment -

          those others you mentioned also look good...

          in fact the last one you mentioned (the java one) might be the easiest to implement

          http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Character.html#isJavaIdentifierStart(char)
          http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Character.html#isJavaIdentifierPart(char)

          i'm not worried about punctuation just real "letters" in languages where .isLetter() returns false!

          Show
          Robert Muir added a comment - those others you mentioned also look good... in fact the last one you mentioned (the java one) might be the easiest to implement http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Character.html#isJavaIdentifierStart(char ) http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Character.html#isJavaIdentifierPart(char ) i'm not worried about punctuation just real "letters" in languages where .isLetter() returns false!
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          In order to produce a correct patch, I need to know what are legal field names. It can hardly be "any UTF-8 string" as that will also contain the colon, which is already used to delimit field names from query strings. What about digits? Asterisk? Dash (minus)? Underscore? Space? Tabulator?

          Lucene does not limit the field names. Those special characters are actually limitations of our query parser syntax. However, you are right, we need to view them from Solr's point of view. Let us try to limit this to valid Java identifiers or the closest that we can get to them.

          Show
          Shalin Shekhar Mangar added a comment - - edited In order to produce a correct patch, I need to know what are legal field names. It can hardly be "any UTF-8 string" as that will also contain the colon, which is already used to delimit field names from query strings. What about digits? Asterisk? Dash (minus)? Underscore? Space? Tabulator? Lucene does not limit the field names. Those special characters are actually limitations of our query parser syntax. However, you are right, we need to view them from Solr's point of view. Let us try to limit this to valid Java identifiers or the closest that we can get to them.
          Hide
          Michael Ludwig added a comment -

          FYI, there is a nice Unicode web tool here: http://rishida.net/scripts/uniview/

          Java identifiers exclude dash (minus) and dot ( - and . ); they allow $, € and other currencies.

          An XML NMTOKEN excludes currency symbols, but allows dash, dot, middle dot, underscore, and colon. It also allows Arabic numerals [0-9] at the beginning.

          Colons must be excluded for Solr purposes. But I wouldn't exclude dash and dot.

          Fields are entered in XML (schema.xml), so why not base the type on an XML type? Validation would be easy:

          <!DOCTYPE solr-test[
          <!ELEMENT solr-test EMPTY >
          <!ATTLIST solr-test field NMTOKEN #REQUIRED>
          ]>
          <solr-test field=" 123.Käse-A_Z "/>

          Note the leading and trailing spaces around the attribute value; the XML parser strips these when validating using an NMTOKEN type, so this user error can be excluded fairly simple. The absence of any colon, however, would have to be guaranteed by some other means. Still, I think there are advantages.

          If ensuring the uniqueness of a field name in a schema.xml matters, one could also consider using the NAME type and defining field/@name as ID in the DTD. This would exclude dash, dot, middle dot and Arabic numerals as start characters.

          I think I could supply a patch for NMTOKEN or NAME if this is found desirable.

          Show
          Michael Ludwig added a comment - FYI, there is a nice Unicode web tool here: http://rishida.net/scripts/uniview/ Java identifiers exclude dash (minus) and dot ( - and . ); they allow $, € and other currencies. An XML NMTOKEN excludes currency symbols, but allows dash, dot, middle dot, underscore, and colon. It also allows Arabic numerals [0-9] at the beginning. Colons must be excluded for Solr purposes. But I wouldn't exclude dash and dot. Fields are entered in XML (schema.xml), so why not base the type on an XML type? Validation would be easy: <!DOCTYPE solr-test[ <!ELEMENT solr-test EMPTY > <!ATTLIST solr-test field NMTOKEN #REQUIRED> ]> <solr-test field=" 123.Käse-A_Z "/> Note the leading and trailing spaces around the attribute value; the XML parser strips these when validating using an NMTOKEN type, so this user error can be excluded fairly simple. The absence of any colon, however, would have to be guaranteed by some other means. Still, I think there are advantages. If ensuring the uniqueness of a field name in a schema.xml matters, one could also consider using the NAME type and defining field/@name as ID in the DTD. This would exclude dash, dot, middle dot and Arabic numerals as start characters. I think I could supply a patch for NMTOKEN or NAME if this is found desirable.
          Hide
          Shalin Shekhar Mangar added a comment -

          I think I could supply a patch for NMTOKEN or NAME if this is found desirable.

          Please do!

          Show
          Shalin Shekhar Mangar added a comment - I think I could supply a patch for NMTOKEN or NAME if this is found desirable. Please do!
          Hide
          Michael Ludwig added a comment -

          In the original regular expression - (??!(\w+:|\d+)))\w+ - what is the rationale behind treating a sequence of digits - \d+ - as a separate token?

          Show
          Michael Ludwig added a comment - In the original regular expression - (? ?!(\w+:|\d+)))\w+ - what is the rationale behind treating a sequence of digits - \d+ - as a separate token?
          Hide
          Michael Ludwig added a comment -

          A smiley destroyed my regular expression ...

          Show
          Michael Ludwig added a comment - A smiley destroyed my regular expression ...
          Hide
          Michael Ludwig added a comment -

          The patch has been ready for a week now, but before supplying it, I'd like to understand what is the rationale behind treating a sequence of digits - \d+ - as a separate token in the original regular expression - (??!(\w+:|\d+)))\w+ to be found in SpellingQueryConverter.java?

          Show
          Michael Ludwig added a comment - The patch has been ready for a week now, but before supplying it, I'd like to understand what is the rationale behind treating a sequence of digits - \d+ - as a separate token in the original regular expression - (? ?!(\w+:|\d+)))\w+ to be found in SpellingQueryConverter.java?
          Hide
          Shalin Shekhar Mangar added a comment -

          The patch has been ready for a week now, but before supplying it, I'd like to understand what is the rationale behind treating a sequence of digits - \d+ - as a separate token in the original regular expression

          I'm sorry I forgot about this issue completely. I'm not sure about the rationale. Grant wrote it originally. Grant, can you share your thoughts?

          Show
          Shalin Shekhar Mangar added a comment - The patch has been ready for a week now, but before supplying it, I'd like to understand what is the rationale behind treating a sequence of digits - \d+ - as a separate token in the original regular expression I'm sorry I forgot about this issue completely. I'm not sure about the rationale. Grant wrote it originally. Grant, can you share your thoughts?
          Hide
          David Bowen added a comment -

          Since this ticket is marked resolved, I filed SOLR-1407 to point out some closely related problems.

          Show
          David Bowen added a comment - Since this ticket is marked resolved, I filed SOLR-1407 to point out some closely related problems.
          Hide
          Shalin Shekhar Mangar added a comment -

          Since this ticket is marked resolved, I filed SOLR-1407 to point out some closely related problems.

          Yes, that is how I remembered this one

          Show
          Shalin Shekhar Mangar added a comment - Since this ticket is marked resolved, I filed SOLR-1407 to point out some closely related problems. Yes, that is how I remembered this one
          Hide
          Michael Ludwig added a comment -

          Shalin, thanks for reporting back If I marked the issue resolved, that wasn't my intention. Must have overlooked it. I'm going to post the NMTOKEN version I've come up with to SOLR-1407 so it's not lost. Up to you to consider whether it's reasonable or not. The question about the sequence of digits is still entirely unclear to me, though.

          Show
          Michael Ludwig added a comment - Shalin, thanks for reporting back If I marked the issue resolved, that wasn't my intention. Must have overlooked it. I'm going to post the NMTOKEN version I've come up with to SOLR-1407 so it's not lost. Up to you to consider whether it's reasonable or not. The question about the sequence of digits is still entirely unclear to me, though.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for Solr 1.4

          Show
          Grant Ingersoll added a comment - Bulk close for Solr 1.4

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Michael Ludwig
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development