Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.0.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now.

      1. or.patch
        5 kB
        Andrzej Bialecki
      2. or.patch
        5 kB
        Rob Young
      3. nutch_0.9_OR.patch
        80 kB
        Robert Buccigrossi

        Activity

        Show
        markus17 Markus Jelsma added a comment - Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira
        Hide
        chrismattmann Chris A. Mattmann added a comment -
        Show
        chrismattmann Chris A. Mattmann added a comment - pushing this out per http://bit.ly/c7tBv9
        Hide
        rbuccigrossi Robert Buccigrossi added a comment -

        The problem with the 2007 patch is that it translates:

        brain apples OR oranges
        to
        +brain +apples oranges

        The included patch translates the phrase to

        +brain apples oranges

        (specifically it makes the previous clause non-mandatory, which I believe is the preferred behavior).

        I hope this helps.

        Robert Buccigrossi

        (I also blogged on it at http://blog.tcg.com/tcg/2009/03/adding-the-boolean-operator-or-to-nutch.html )

        Show
        rbuccigrossi Robert Buccigrossi added a comment - The problem with the 2007 patch is that it translates: brain apples OR oranges to +brain +apples oranges The included patch translates the phrase to +brain apples oranges (specifically it makes the previous clause non-mandatory, which I believe is the preferred behavior). I hope this helps. Robert Buccigrossi (I also blogged on it at http://blog.tcg.com/tcg/2009/03/adding-the-boolean-operator-or-to-nutch.html )
        Hide
        ab Andrzej Bialecki added a comment -

        The current patch is not sufficient to solve the issue - postponing to 1.1.

        Show
        ab Andrzej Bialecki added a comment - The current patch is not sufficient to solve the issue - postponing to 1.1.
        Hide
        steiny Sebastian Steinmetz added a comment -

        I've integrated this patch against 0.9 and it seems to have some problems. I'm not sure, if I should have tested it against trunk.

        The problem is the parsing itself. If I enter "foo OR bar" (without qotes) it gets reformed to a Lucene query which reads like: "+foo bar" but it should be "foo bar" i think...
        At least is the result, if I change it in debug mode by hand to "foo bar", the pages that I would expect to appear.

        I don't have any idea, how to fix this yet. But if I have one, I'll let you know...

        Show
        steiny Sebastian Steinmetz added a comment - I've integrated this patch against 0.9 and it seems to have some problems. I'm not sure, if I should have tested it against trunk. The problem is the parsing itself. If I enter "foo OR bar" (without qotes) it gets reformed to a Lucene query which reads like: "+foo bar" but it should be "foo bar" i think... At least is the result, if I change it in debug mode by hand to "foo bar", the pages that I would expect to appear. I don't have any idea, how to fix this yet. But if I have one, I'll let you know...
        Hide
        bubblenut Rob Young added a comment -

        I've changed the patch slightly to work around the bug I mentioned earlier.
        Now the queries look like this
        name:"name value" OR name:"other value"
        and are expanded to
        +name:"name value" name:"other value"

        Show
        bubblenut Rob Young added a comment - I've changed the patch slightly to work around the bug I mentioned earlier. Now the queries look like this name:"name value" OR name:"other value" and are expanded to +name:"name value" name:"other value"
        Hide
        bubblenut Rob Young added a comment -

        Hi I've found a bug in this patch. If I search for title:red OR"title:blue" I would expect it to be expanded to
        +title:"red" title:"blue" but in fact it expands to +title:"red" "title:blue" so there is no way to do term specific queries.

        Show
        bubblenut Rob Young added a comment - Hi I've found a bug in this patch. If I search for title:red OR"title:blue" I would expect it to be expanded to +title:"red" title:"blue" but in fact it expands to +title:"red" "title:blue" so there is no way to do term specific queries.
        Hide
        cutting Doug Cutting added a comment -

        Neither. It would end up as the Lucene query:

        +"search phrase" +category:cat1 category:cat2

        where category:cat2 is a non-required clause that just impacts ranking, not the set of documents returned.

        As for nested queries, parsing is only half the problem. The query filter plugins would need to be extended to handle such things, as they presently expect flat queries.

        The query "foo bar" currently expands to a Lucene query that looks something like:

        +(anchor:foo title:foo content:foo)
        +(anchor:bar title:bar content:bar)
        anchor:"foo bar"~10
        title:"foo bar"~1000
        content:"foo bar"~1000

        (The latter three boost scores when terms are nearer. Anchor proximity is limited, to keep from matching anchors from other documents.)

        So, how should (foo AND (bar OR baz) expand? Probably something like:

        +(anchor:foo title:foo content:foo)
        +((anchor:bar title:bar content:bar)
        (anchor:baz title:baz content:baz))
        ... proximity boosting clauses?...

        And (foo OR (bar AND baz)) might expand to:

        (anchor:foo title:foo content:foo)
        (+(anchor:bar title:bar content:bar)
        +(anchor:baz title:baz content:baz))
        ... proximity boosting clauses?...

        This expansion is done by the query-basic plugin.

        Show
        cutting Doug Cutting added a comment - Neither. It would end up as the Lucene query: +"search phrase" +category:cat1 category:cat2 where category:cat2 is a non-required clause that just impacts ranking, not the set of documents returned. As for nested queries, parsing is only half the problem. The query filter plugins would need to be extended to handle such things, as they presently expect flat queries. The query "foo bar" currently expands to a Lucene query that looks something like: +(anchor:foo title:foo content:foo) +(anchor:bar title:bar content:bar) anchor:"foo bar"~10 title:"foo bar"~1000 content:"foo bar"~1000 (The latter three boost scores when terms are nearer. Anchor proximity is limited, to keep from matching anchors from other documents.) So, how should (foo AND (bar OR baz) expand? Probably something like: +(anchor:foo title:foo content:foo) +((anchor:bar title:bar content:bar) (anchor:baz title:baz content:baz)) ... proximity boosting clauses?... And (foo OR (bar AND baz)) might expand to: (anchor:foo title:foo content:foo) (+(anchor:bar title:bar content:bar) +(anchor:baz title:baz content:baz)) ... proximity boosting clauses?... This expansion is done by the query-basic plugin.
        Hide
        bubblenut Rob Young added a comment -

        How would this work in the following case?

        "search phrase" category:cat1 OR category:cat2

        would it end up as

        ("search phrase" AND category:cat1) OR category:cat2

        or as

        "search phrase" AND (category:cat1 OR category:cat2)

        Show
        bubblenut Rob Young added a comment - How would this work in the following case? "search phrase" category:cat1 OR category:cat2 would it end up as ("search phrase" AND category:cat1) OR category:cat2 or as "search phrase" AND (category:cat1 OR category:cat2)
        Hide
        ab Andrzej Bialecki added a comment -

        Correct - the only syntax element added in this patch is an OR clause. Nested queries like that are probably not high on the priority list, because they may be expensive to run, and they would also complicate the implementation of QueryFilter plugins. Anyway, improvements are welcome

        Show
        ab Andrzej Bialecki added a comment - Correct - the only syntax element added in this patch is an OR clause. Nested queries like that are probably not high on the priority list, because they may be expensive to run, and they would also complicate the implementation of QueryFilter plugins. Anyway, improvements are welcome
        Hide
        niqueco Nicolás Lichtmaier added a comment -

        This patch doesn't seem to add support for nested clauses like this:

        "greenhouse effect" OR ( climate AND change )

        I don't know if this full boolean logic support is important. But I've been asked to implement it here... =(

        Show
        niqueco Nicolás Lichtmaier added a comment - This patch doesn't seem to add support for nested clauses like this: "greenhouse effect" OR ( climate AND change ) I don't know if this full boolean logic support is important. But I've been asked to implement it here... =(
        Hide
        ab Andrzej Bialecki added a comment -

        Patch based on the discussion on the mailing list, and a description provided by Nguien Ngoc Giang. There's a bug in this patch - when OR is used inside a phrase a parse exception is thrown. I'm not a JavaCC wizard, so I didn't know how to fix it.

        Show
        ab Andrzej Bialecki added a comment - Patch based on the discussion on the mailing list, and a description provided by Nguien Ngoc Giang. There's a bug in this patch - when OR is used inside a phrase a parse exception is thrown. I'm not a JavaCC wizard, so I didn't know how to fix it.

          People

          • Assignee:
            ab Andrzej Bialecki
            Reporter:
            ab Andrzej Bialecki
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development