PyLucene
  1. PyLucene
  2. PYLUCENE-9

QueryParser replacing stop words with wildcards

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Labels:
      None
    • Environment:
      Windows XP 32-bit Sp3, Ubuntu 10.04.2 LTS i686 GNU/Linux, jdk1.6.0_23

      Description

      Was using query parser to build a query. In Java Lucene (as well as Lucene.Net), the query "Calendar Item as Msg" (quotes included), is parsed properly as FullText:"calendar item msg" in Java Lucene and Lucene.Net. In pylucene, it is parsed as: FullText:"calendar item ? msg". This causes obvious problems when comparing search results from python, java and .net.

      Initially, I thought it was the Analyzer I was using, but I've tried the StandardAnalyzer and StopAnalyzer, which work properly in Java and .Net, but not pylucene.

      Here is code I've used to reproduce the issue:

      >>> from lucene import StandardAnalyzer, StopAnalyzer, QueryParser, Version
      >>> analyzer = StandardAnalyzer(Version.LUCENE_30)
      >>> query = QueryParser(Version.LUCENE_30, "FullText", analyzer)
      >>> parsedQuery = query.parse("\"Calendar Item as Msg\"")
      >>> parsedQuery
      <Query: FullText:"calendar item ? msg">
      >>> analyzer = StopAnalyzer(Version.LUCENE_30)
      >>> query = QueryParser(Version.LUCENE_30)
      >>> parsedQuery = query.parse("\"Calendar Item as Msg\"")
      >>> parsedQuery
      <Query: FullText:"calendar item ? msg">

      I've noticed this in pylucene 2.9.4, 2.9.3, and 3.0.3

        Activity

        Hide
        Christopher Currens added a comment -

        We can close it. Thanks for the help.

        Show
        Christopher Currens added a comment - We can close it. Thanks for the help.
        Hide
        Andi Vajda added a comment -

        Hi Christopher,
        Have you elucidated this yet ?
        Can this bug be closed or is there still something to be done for it ?

        Show
        Andi Vajda added a comment - Hi Christopher, Have you elucidated this yet ? Can this bug be closed or is there still something to be done for it ?
        Hide
        Christopher Currens added a comment -

        Hmm, the code I have is nearly identical, and when I pull it out of the contained code, it behaves as it should. I can't post the whole code, but the issue must be that there's a lingering Version.LUCENE_24 somewhere I suppose. I'll try figuring it out on my own, I'm glad to see its something idiotic I've done.

        Show
        Christopher Currens added a comment - Hmm, the code I have is nearly identical, and when I pull it out of the contained code, it behaves as it should. I can't post the whole code, but the issue must be that there's a lingering Version.LUCENE_24 somewhere I suppose. I'll try figuring it out on my own, I'm glad to see its something idiotic I've done.
        Hide
        Andi Vajda added a comment -

        So I wrote this simple class, foo:

        import org.apache.lucene.analysis.standard.StandardAnalyzer;
        import org.apache.lucene.queryParser.QueryParser;
        import org.apache.lucene.queryParser.ParseException;
        import org.apache.lucene.util.Version;

        public class foo {
        static void parse(Version version)
        throws org.apache.lucene.queryParser.ParseException

        { System.out.println( version + " " + new QueryParser(version, "ft", new StandardAnalyzer(version)) .parse("\"Calendar Item as Msg\"")); }

        static public void main(String[] args)
        throws org.apache.lucene.queryParser.ParseException

        { parse(Version.LUCENE_24); parse(Version.LUCENE_29); parse(Version.LUCENE_30); parse(Version.LUCENE_CURRENT); }

        }

        I then compiled it against the lucene-3.0.3 jar:
        $ javac -cp lucene-java-3.0.3/build/lucene-core-3.0.3.jar foo.java
        and then ran it against the same jars:
        $ java -cp lucene-java-3.0.3/build/lucene-core-3.0.3.jar:. foo
        LUCENE_24 ft:"calendar item msg"
        LUCENE_29 ft:"calendar item ? msg"
        LUCENE_30 ft:"calendar item ? msg"
        LUCENE_CURRENT ft:"calendar item ? msg"

        As you can see, the same behavior is seen without PyLucene, just plain java. The parsing behavior you expect seems to happen only with Version.LUCENE_24. Please, send java code (as PyLucene seems out of the picture for now), that reproduces the problem.

        Show
        Andi Vajda added a comment - So I wrote this simple class, foo: import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.util.Version; public class foo { static void parse(Version version) throws org.apache.lucene.queryParser.ParseException { System.out.println( version + " " + new QueryParser(version, "ft", new StandardAnalyzer(version)) .parse("\"Calendar Item as Msg\"")); } static public void main(String[] args) throws org.apache.lucene.queryParser.ParseException { parse(Version.LUCENE_24); parse(Version.LUCENE_29); parse(Version.LUCENE_30); parse(Version.LUCENE_CURRENT); } } I then compiled it against the lucene-3.0.3 jar: $ javac -cp lucene-java-3.0.3/build/lucene-core-3.0.3.jar foo.java and then ran it against the same jars: $ java -cp lucene-java-3.0.3/build/lucene-core-3.0.3.jar:. foo LUCENE_24 ft:"calendar item msg" LUCENE_29 ft:"calendar item ? msg" LUCENE_30 ft:"calendar item ? msg" LUCENE_CURRENT ft:"calendar item ? msg" As you can see, the same behavior is seen without PyLucene, just plain java. The parsing behavior you expect seems to happen only with Version.LUCENE_24. Please, send java code (as PyLucene seems out of the picture for now), that reproduces the problem.
        Hide
        Christopher Currens added a comment -

        I've posted a question to the java-lucene list, however, I'm sure it won't help at all. The simple fact is that the lucene 3.0 jar parses the query as <ft:"calendar item msg">. The same lucene 3.0 jar when invoked from pylucene, produces <ft:"calendar item ? msg"> for me, on both windows and ubuntu boxes.

        I suppose this just might be an issue with jcc? I've been able to produce this both on my boxes at work, and my box at home, both producing the incorrect output. Perhaps I'm most curious if this can be reproduced by any developer for pylucene, or if its just some crazy environment issue happening on my boxes and everyone else I know.

        Show
        Christopher Currens added a comment - I've posted a question to the java-lucene list, however, I'm sure it won't help at all. The simple fact is that the lucene 3.0 jar parses the query as <ft:"calendar item msg">. The same lucene 3.0 jar when invoked from pylucene, produces <ft:"calendar item ? msg"> for me, on both windows and ubuntu boxes. I suppose this just might be an issue with jcc? I've been able to produce this both on my boxes at work, and my box at home, both producing the incorrect output. Perhaps I'm most curious if this can be reproduced by any developer for pylucene, or if its just some crazy environment issue happening on my boxes and everyone else I know.
        Hide
        Andi Vajda added a comment -

        Could you please ask on the java-user@lucene.apache.org list what is actually the expected behavior from Java Lucene's point of view with versions Version.LUCENE_24, 29 and 30 passed to both the QueryParser and StandardAnalyzer contructors.
        I remember this changing at some point but I'm not sure when. Nor do I see, without further investigation how PyLucene could be different there as it "just invokes" the embedded Java Lucene jar. Thanks !

        Show
        Andi Vajda added a comment - Could you please ask on the java-user@lucene.apache.org list what is actually the expected behavior from Java Lucene's point of view with versions Version.LUCENE_24, 29 and 30 passed to both the QueryParser and StandardAnalyzer contructors. I remember this changing at some point but I'm not sure when. Nor do I see, without further investigation how PyLucene could be different there as it "just invokes" the embedded Java Lucene jar. Thanks !
        Hide
        Christopher Currens added a comment -

        I was very hesitant to report this as a bug, since pylucene isn't a port, rather just recompiled. I am positive I am comparing the correct versions (I'm a committer on Lucene.Net). I'll show you all the configurations I've done:

        Lucene.Net 2.9.2 - Valid
        Lucene.Net 2.9.4 - Valid
        Java Lucene (via Luke 1.0.1 (uses Lucene 2.9.4)) - Valid
        Java Lucene (via Luke 3.1.0 (uses > Lucene 3.0)) - Valid
        pyLucene (Lucene 2.9.2) - Invalid replaced by single Wildcard ('?')
        pyLucene (Lucene 2.9.4) - Invalid replaced by single Wildcard ('?')
        pyLucene (Lucene 3.0.3) - Invalid replaced by single Wildcard ('?')

        Those tests are all on the 32-bin Win-XP. The ubuntu box I've used was using pyLucene w/ lucene 2.9.2.

        One thing I hadn't considered, though, was to see if it can be replicated outside of the many machines I've used myself to test, specifically if there's in issue with our building of it via JCC, or something in our environment. But considering I've tried it at work and at home, there's no real other place I can test it.

        Show
        Christopher Currens added a comment - I was very hesitant to report this as a bug, since pylucene isn't a port, rather just recompiled. I am positive I am comparing the correct versions (I'm a committer on Lucene.Net). I'll show you all the configurations I've done: Lucene.Net 2.9.2 - Valid Lucene.Net 2.9.4 - Valid Java Lucene (via Luke 1.0.1 (uses Lucene 2.9.4)) - Valid Java Lucene (via Luke 3.1.0 (uses > Lucene 3.0)) - Valid pyLucene (Lucene 2.9.2) - Invalid replaced by single Wildcard ('?') pyLucene (Lucene 2.9.4) - Invalid replaced by single Wildcard ('?') pyLucene (Lucene 3.0.3) - Invalid replaced by single Wildcard ('?') Those tests are all on the 32-bin Win-XP. The ubuntu box I've used was using pyLucene w/ lucene 2.9.2. One thing I hadn't considered, though, was to see if it can be replicated outside of the many machines I've used myself to test, specifically if there's in issue with our building of it via JCC, or something in our environment. But considering I've tried it at work and at home, there's no real other place I can test it.
        Hide
        Andi Vajda added a comment -

        Are you sure you're comparing the right versions ?

        Lucene.Net is quite behind Java Lucene and in more recent versions lots of things changed.
        For instance, trying different Version instances gives different results, notably LUCENE_24 works as you seem to expect:
        >>> qp = QueryParser(Version.LUCENE_29, "ft", StandardAnalyzer(Version.LUCENE_29))
        >>> qp.parse('"Calendar Item as Msg"')
        <Query: ft:"calendar item ? msg"> <-- the 'as' stop word gets replaced by a hole as expected in that version

        >>> qp = QueryParser(Version.LUCENE_24, "ft", StandardAnalyzer(Version.LUCENE_24))
        >>> qp.parse('"Calendar Item as Msg"')
        <Query: ft:"calendar item msg"> <-- works as Lucene.Net (probably, as I've never run it)

        I'm inclined to resolve this bug as INVALID unless I'm missing something here.
        Please, let me know.

        Show
        Andi Vajda added a comment - Are you sure you're comparing the right versions ? Lucene.Net is quite behind Java Lucene and in more recent versions lots of things changed. For instance, trying different Version instances gives different results, notably LUCENE_24 works as you seem to expect: >>> qp = QueryParser(Version.LUCENE_29, "ft", StandardAnalyzer(Version.LUCENE_29)) >>> qp.parse('"Calendar Item as Msg"') <Query: ft:"calendar item ? msg"> <-- the 'as' stop word gets replaced by a hole as expected in that version >>> qp = QueryParser(Version.LUCENE_24, "ft", StandardAnalyzer(Version.LUCENE_24)) >>> qp.parse('"Calendar Item as Msg"') <Query: ft:"calendar item msg"> <-- works as Lucene.Net (probably, as I've never run it) I'm inclined to resolve this bug as INVALID unless I'm missing something here. Please, let me know.

          People

          • Assignee:
            Unassigned
            Reporter:
            Christopher Currens
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development