Lucene - Core
  1. Lucene - Core
  2. LUCENE-682

QueryParser with Locale Based Operators (French included)

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      Here is a version of the QueryParser that can "understand" the AND, OR and NOT keyword in other languages.

      If activated,

      • "a ET b" should return the same query as "a AND b", namely: "+a +b"
      • "a OU b" should return the same query as "a OR b", namely: "a b"
      • "a SAUF b" should return the same query as "a NOT b", namely: "a -b"

      Here are its main points :

      1) Patched from revision 454774 of lucene 2.1dev (trunk) (probably could be used with other versions)
      2) The "ant test" target is still successful when the modified QueryParser is used
      3) It doesn't break actual code
      4) The default behavior is the same as before
      5) It has to be deliberately activated
      6) It use ResourceBundle to find the keywords translation
      7) Comes with FRENCH translation
      8) Comes with JUnit testCases
      9) Adds 1 public method to QueryParser
      10) Expands the TOKEN <TERM>
      11) Use TOKEN_MGR_DECLS to set some field for the TokenManager

      1. LocalizedQueryParser.patch
        21 kB
        Hoss Man
      2. LocalizedQueryParser.patch
        19 kB
        Hoss Man
      3. LocalizedQueryParser.zip
        12 kB
        Patrick Turcotte
      4. LocalizedQueryParserDemo.java
        0.6 kB
        Patrick Turcotte
      5. LocalizedQueryParserOperatorsMicroBench.java
        3 kB
        Hoss Man
      6. QueryParser_fr.properties
        0.0 kB
        Patrick Turcotte
      7. QueryParser.jj
        30 kB
        Patrick Turcotte
      8. QueryParser.jj.patch
        4 kB
        Patrick Turcotte
      9. QueryParser.properties
        0.0 kB
        Patrick Turcotte
      10. TestQueryParserLocaleOperators.java
        3 kB
        Patrick Turcotte

        Activity

        Hide
        Patrick Turcotte added a comment -

        Sorry, didn't see the attach multiple files before.

        Show
        Patrick Turcotte added a comment - Sorry, didn't see the attach multiple files before.
        Hide
        Otis Gospodnetic added a comment -

        I like this and have a question. The createLocalizedTokenMap() method is called from that new setter method.
        Since QueryParser is not thread safe, one has to instantiate a new QP, set the Locale and call that setter before each parse(....) call. Unless ResourceBundle does some internal caching, doesn't this mean each parsed query will execute that createLocalizedTokenMap() method? Since the resource files are not likely to change, shouldn't we cache things?

        Show
        Otis Gospodnetic added a comment - I like this and have a question. The createLocalizedTokenMap() method is called from that new setter method. Since QueryParser is not thread safe, one has to instantiate a new QP, set the Locale and call that setter before each parse(....) call. Unless ResourceBundle does some internal caching, doesn't this mean each parsed query will execute that createLocalizedTokenMap() method? Since the resource files are not likely to change, shouldn't we cache things?
        Hide
        Patrick Turcotte added a comment -

        From what I read in the Javadoc of ResouceBundle, it does cache the values.

        Show
        Patrick Turcotte added a comment - From what I read in the Javadoc of ResouceBundle, it does cache the values.
        Hide
        Patrick Turcotte added a comment -

        New versions. All in the zip file. Improvements are:

        • By default, if used, using a localized version disable English tokens (AND, NOT, OR)
        • More than one operator may be define in the bundle (separated by ';')
        • &&, || and ! operators are always active.

        A note on ResourceBundle, as I had to do some test to understand the documentation:

        1) getStringArray() is really just a wrapper on (String[]) getObject() and throws exception on PropertiesResourceBundle and ListResourceBundle.

        2) The order to get a bundle is a little tricky. Javadoc should be read :

        • baseName + "" + selectedLanguage + "" + country1 + "_" + variant1
        • baseName + "" + selectedLanguage + "" + country1
        • baseName + "_" + selectedLanguage
        • baseName + "" + defaultLocaleLanguage + "" + country2 + "_" + variant2
        • baseName + "" + defaultLocaleLanguage + "" + country2
        • baseName + "_" + defaultLocaleLanguage
        • baseName
          Which means that if a bundle exists by that baseName for your defaultLocale (Locale.getDefault()), you'll get values from it instead of from the baseName bundle.

        Thanks to Chris Hostetter and Otis Gospodnetic for your comments and questions.

        Show
        Patrick Turcotte added a comment - New versions. All in the zip file. Improvements are: By default, if used, using a localized version disable English tokens (AND, NOT, OR) More than one operator may be define in the bundle (separated by ';') &&, || and ! operators are always active. A note on ResourceBundle, as I had to do some test to understand the documentation: 1) getStringArray() is really just a wrapper on (String[]) getObject() and throws exception on PropertiesResourceBundle and ListResourceBundle. 2) The order to get a bundle is a little tricky. Javadoc should be read : baseName + " " + selectedLanguage + " " + country1 + "_" + variant1 baseName + " " + selectedLanguage + " " + country1 baseName + "_" + selectedLanguage baseName + " " + defaultLocaleLanguage + " " + country2 + "_" + variant2 baseName + " " + defaultLocaleLanguage + " " + country2 baseName + "_" + defaultLocaleLanguage baseName Which means that if a bundle exists by that baseName for your defaultLocale (Locale.getDefault()), you'll get values from it instead of from the baseName bundle. Thanks to Chris Hostetter and Otis Gospodnetic for your comments and questions.
        Hide
        Hoss Man added a comment -

        LocalizedQueryParser.patch contains everything in the most recent LocalizedQueryParser.zip, except in patch form, with a few minor changes...
        1) moved files into tree as appropraite
        2) reformated (\t to 2 spaces, license header in test)
        3) included javacc generated QueryParser*.java since they
        need to be commited too.
        4) added <copy> directive to build.xml so property files
        would make it into jar (and classpath for tests)

        The code looks great and the unit tests provide nice coverage.

        Like Otis, I'm also a little worried about the createLocalizedTokenMap() calls ... the ResourceBundle lookups may be cached, but there's also the splitting, and the fact that createLocalizedTokenMap() is called in both setLocale and setUseLocalizedOperators ... setLocale might be used just for the date parsing, so that could result in wasted cycles for some people ... not to mention setBundleBaseName doesn't call createLocalizedTokenMap so people who do...

        QueryParser qp = new QueryParser
        qp.setLocale(...)
        qp.setUseLocalizedOperators(true)
        qp.setBundleBaseName(...)

        ...will wind up triggering createLocalizedTokenMap twice, and still not use the Bundle the wanted to.

        Even if we don't want to worry about caching the post-split arrays per Bundle, I think it might make more sense if setUseLocalizedOperators was the only method that called createLocalizedTokenMap, and it was documented that if should be called after setLocale and setBundleBaseName.

        ...any other opinions on this?

        Other things that should probably be done before commiting:

        a) have at least one test using a Bundle that excercises
        the splitting.
        b) add some javadoc verbage to QueryParser.setLocale
        clarifying how it affects the operators.

        ...I may be able to find some time to play with this some more later this week and make those changes myself.

        Show
        Hoss Man added a comment - LocalizedQueryParser.patch contains everything in the most recent LocalizedQueryParser.zip, except in patch form, with a few minor changes... 1) moved files into tree as appropraite 2) reformated (\t to 2 spaces, license header in test) 3) included javacc generated QueryParser*.java since they need to be commited too. 4) added <copy> directive to build.xml so property files would make it into jar (and classpath for tests) The code looks great and the unit tests provide nice coverage. Like Otis, I'm also a little worried about the createLocalizedTokenMap() calls ... the ResourceBundle lookups may be cached, but there's also the splitting, and the fact that createLocalizedTokenMap() is called in both setLocale and setUseLocalizedOperators ... setLocale might be used just for the date parsing, so that could result in wasted cycles for some people ... not to mention setBundleBaseName doesn't call createLocalizedTokenMap so people who do... QueryParser qp = new QueryParser qp.setLocale(...) qp.setUseLocalizedOperators(true) qp.setBundleBaseName(...) ...will wind up triggering createLocalizedTokenMap twice, and still not use the Bundle the wanted to. Even if we don't want to worry about caching the post-split arrays per Bundle, I think it might make more sense if setUseLocalizedOperators was the only method that called createLocalizedTokenMap, and it was documented that if should be called after setLocale and setBundleBaseName. ...any other opinions on this? Other things that should probably be done before commiting: a) have at least one test using a Bundle that excercises the splitting. b) add some javadoc verbage to QueryParser.setLocale clarifying how it affects the operators. ...I may be able to find some time to play with this some more later this week and make those changes myself.
        Hide
        Yonik Seeley added a comment -

        Does anyone know what is the likely performance impact is when not using this feature? It's not easy for me to tell at a glance.

        Show
        Yonik Seeley added a comment - Does anyone know what is the likely performance impact is when not using this feature? It's not easy for me to tell at a glance.
        Hide
        Hoss Man added a comment -

        Revised version of the patch – includes the changes so that only one method creates the lists; a test of the splitting logic; more javadocs clarifing the interaction of the methods.

        Show
        Hoss Man added a comment - Revised version of the patch – includes the changes so that only one method creates the lists; a test of the splitting logic; more javadocs clarifing the interaction of the methods.
        Hide
        Hoss Man added a comment -

        microbenchmark of the "default" (no ResourceBundle) usage, run against the current trunk, and with this change (to determine the performance costs of the added work in the Javacc generated code)

        two tests, one of a new instance for each parse call, one of reusing the same instance for 5 parse calls; 3 runs each, 10000 iterations each, time in seconds...

        1 1 1 5 5 5
        trunk: 1.897 1.904 1.913 7.415 7.447 7.446
        w/patch: 2.01 2.005 2.01 7.851 7.888 7.886

        ...doesn't seem like a big enough factor to worry about (unless i missed something obvious when i wrote the test ... i was on a plane at the time and slightly distracted by the very chatty woman next to me)

        Show
        Hoss Man added a comment - microbenchmark of the "default" (no ResourceBundle) usage, run against the current trunk, and with this change (to determine the performance costs of the added work in the Javacc generated code) two tests, one of a new instance for each parse call, one of reusing the same instance for 5 parse calls; 3 runs each, 10000 iterations each, time in seconds... 1 1 1 5 5 5 trunk: 1.897 1.904 1.913 7.415 7.447 7.446 w/patch: 2.01 2.005 2.01 7.851 7.888 7.886 ...doesn't seem like a big enough factor to worry about (unless i missed something obvious when i wrote the test ... i was on a plane at the time and slightly distracted by the very chatty woman next to me)
        Hide
        Hoss Man added a comment -

        issues has a modest number of notes, and seems to me like it would be very usefull as more language property files are contributed ... does any one have any objections to it being commited?

        NOTE: there are some negative performance impacts to QueryParser as a result of this patch.

        Show
        Hoss Man added a comment - issues has a modest number of notes, and seems to me like it would be very usefull as more language property files are contributed ... does any one have any objections to it being commited? NOTE: there are some negative performance impacts to QueryParser as a result of this patch.
        Hide
        Yonik Seeley added a comment -

        Frankly, I'm not excited about a 6% performance loss so that someone can customize
        a total of 3 tokens that don't add additional expressive power or features. AND, OR, and NOT, are short and easy to understand even for foreign-language speakers. Consider that to construct raw Lucene queries themselves, they would need to know Lucene, and for that, they will most likely have a passing familiarity with English anyway.

        I think this would be better implemented as a preprocessor, outside of the query parser.
        I don't think that would be too hard, and then there would be no performance impact for the 99% of people who will stick with AND/OR/NOT (or +/-)

        It might even be expressible as a regular expression.

        Maybe it's just me though, so I wouldn't mind hearing some other opinions.

        Show
        Yonik Seeley added a comment - Frankly, I'm not excited about a 6% performance loss so that someone can customize a total of 3 tokens that don't add additional expressive power or features. AND, OR, and NOT, are short and easy to understand even for foreign-language speakers. Consider that to construct raw Lucene queries themselves, they would need to know Lucene, and for that, they will most likely have a passing familiarity with English anyway. I think this would be better implemented as a preprocessor, outside of the query parser. I don't think that would be too hard, and then there would be no performance impact for the 99% of people who will stick with AND/OR/NOT (or +/-) It might even be expressible as a regular expression. Maybe it's just me though, so I wouldn't mind hearing some other opinions.
        Hide
        Yonik Seeley added a comment -

        Something like this perhaps:

        public static String change(String s, String AND, String OR, String NOT) {
        int len = s.length();
        StringBuilder b = new StringBuilder();
        boolean newString=false;
        boolean changed=false;
        boolean inString=false;
        char prev='!';
        for (int i=0; i<s.length(); i++) {
        char ch = s.charAt;
        switch (ch) {
        case '
        ' : b.append(s.charAt(++i)); break;
        case '\'' : inString=!inString; break;
        case 'A' :
        if (!inString
        && !Character.isJavaIdentifierPart(prev)
        && i+2 < s.length()
        && s.charAt(i+1) == 'N'
        && s.charAt(i+2) == 'D'
        && (i+3==s.length() || !Character.isJavaIdentifierPart(s.charAt(i+3))))

        { b.append(AND); changed=true; i+=2; }

        break;
        case 'O' :
        if (!inString
        && !Character.isJavaIdentifierPart(prev)
        && i+1 < s.length()
        && s.charAt(i+1) == 'R'
        && (i+2==s.length() || !Character.isJavaIdentifierPart(s.charAt(i+2))))

        { b.append(OR); changed=true; i+=1; }

        break;
        case 'N' :
        if (!inString
        && !Character.isJavaIdentifierPart(prev)
        && i+2 < s.length()
        && s.charAt(i+1) == 'O'
        && s.charAt(i+2) == 'T'
        && (i+3==s.length() || !Character.isJavaIdentifierPart(s.charAt(i+3))))

        { b.append(NOT); changed=true; i+=2; }

        break;
        default: break;
        }
        if (changed)

        { newString=true; changed=false; }

        else

        { b.append(ch); prev = ch; }

        }
        return newString ? b.toString() : s;
        }

        Show
        Yonik Seeley added a comment - Something like this perhaps: public static String change(String s, String AND, String OR, String NOT) { int len = s.length(); StringBuilder b = new StringBuilder(); boolean newString=false; boolean changed=false; boolean inString=false; char prev='!'; for (int i=0; i<s.length(); i++) { char ch = s.charAt ; switch (ch) { case ' ' : b.append(s.charAt(++i)); break; case '\'' : inString=!inString; break; case 'A' : if (!inString && !Character.isJavaIdentifierPart(prev) && i+2 < s.length() && s.charAt(i+1) == 'N' && s.charAt(i+2) == 'D' && (i+3==s.length() || !Character.isJavaIdentifierPart(s.charAt(i+3)))) { b.append(AND); changed=true; i+=2; } break; case 'O' : if (!inString && !Character.isJavaIdentifierPart(prev) && i+1 < s.length() && s.charAt(i+1) == 'R' && (i+2==s.length() || !Character.isJavaIdentifierPart(s.charAt(i+2)))) { b.append(OR); changed=true; i+=1; } break; case 'N' : if (!inString && !Character.isJavaIdentifierPart(prev) && i+2 < s.length() && s.charAt(i+1) == 'O' && s.charAt(i+2) == 'T' && (i+3==s.length() || !Character.isJavaIdentifierPart(s.charAt(i+3)))) { b.append(NOT); changed=true; i+=2; } break; default: break; } if (changed) { newString=true; changed=false; } else { b.append(ch); prev = ch; } } return newString ? b.toString() : s; }
        Hide
        Yonik Seeley added a comment -

        That's untested code of course... I just noticed that
        case '
        ' : b.append(s.charAt(++i)); break;
        case '\'' : inString=!inString; break;
        should probably be
        case '"': if (++i<len)

        {b.append(ch); ch=s.charAt(i);}

        break;
        case '\'' : inString=!inString; break;

        It can probably be made faster too... but the point is that it's independent of the query parser and more easily customized.

        Show
        Yonik Seeley added a comment - That's untested code of course... I just noticed that case ' ' : b.append(s.charAt(++i)); break; case '\'' : inString=!inString; break; should probably be case '"': if (++i<len) {b.append(ch); ch=s.charAt(i);} break; case '\'' : inString=!inString; break; It can probably be made faster too... but the point is that it's independent of the query parser and more easily customized.
        Hide
        Jan Høydahl added a comment -

        This issue has been inactive for more than 4 years. Please close if it's no longer relevant/needed, or bring it up to date if you intend to work on it. SPRING_CLEANING_2013

        Show
        Jan Høydahl added a comment - This issue has been inactive for more than 4 years. Please close if it's no longer relevant/needed, or bring it up to date if you intend to work on it. SPRING_CLEANING_2013

          People

          • Assignee:
            Hoss Man
            Reporter:
            Patrick Turcotte
          • Votes:
            3 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development