Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: web gui
    • Labels:
      None
    • Environment:

      web environment

      Description

      There should be a limitation (user defined) on the number of results the search engine can return.

      For example, if one modifies the seach url as:
      http://<my>/search.jsp?query=<some quiery>&hitsPerPage=20000&hitsPerSite=0

      The search will try to return 20,000 pages which isn't good for the server side performance.

      Is it possible to have a setting in the config xml files to control this?

      Thanks,
      Emilijan

      1. NUTCH-44.patch
        3 kB
        Susam Pal
      2. NUTCH-44-2-20080215.patch
        3 kB
        Dennis Kubes

        Activity

        Hide
        cutting Doug Cutting added a comment -

        I agree. There should be a limit in the config file. By default the limit should be 1000 hits. A patch, anyone?

        Show
        cutting Doug Cutting added a comment - I agree. There should be a limit in the config file. By default the limit should be 1000 hits. A patch, anyone?
        Hide
        byronm byron miller added a comment -

        I am working on some code i will submit over the weekend to set a max value for hits per page.

        I discovered this to be a serious issue with the opensearch as well since some people were sucking down wayyyyy too many records!

        Show
        byronm byron miller added a comment - I am working on some code i will submit over the weekend to set a max value for hits per page. I discovered this to be a serious issue with the opensearch as well since some people were sucking down wayyyyy too many records!
        Hide
        siren Sami Siren added a comment -

        Byron, have you made any progress with this?

        Show
        siren Sami Siren added a comment - Byron, have you made any progress with this?
        Hide
        neufeind Stefan Neufeind added a comment -

        hi,
        any progress on this?

        Show
        neufeind Stefan Neufeind added a comment - hi, any progress on this?
        Hide
        susam Susam Pal added a comment -

        Attached a patch.

        To apply:-

        patch -p0 < NUTCH-44.patch
        ant war
        cp build/nutch*war $CATALINA_HOME/webapps/ROOT.war

        Show
        susam Susam Pal added a comment - Attached a patch. To apply:- patch -p0 < NUTCH-44 .patch ant war cp build/nutch*war $CATALINA_HOME/webapps/ROOT.war
        Hide
        susam Susam Pal added a comment -

        Updated my previous patch to fix the issue in opensearch too.

        To apply:-

        patch -p0 < NUTCH-44.patch
        ant war
        cp build/nutch*war $CATALINA_HOME/webapps/ROOT.war

        Show
        susam Susam Pal added a comment - Updated my previous patch to fix the issue in opensearch too. To apply:- patch -p0 < NUTCH-44 .patch ant war cp build/nutch*war $CATALINA_HOME/webapps/ROOT.war
        Hide
        musepwizard Dennis Kubes added a comment -

        +1 on this. If nobody has any objections to this I will commit it tomorrow morning

        Show
        musepwizard Dennis Kubes added a comment - +1 on this. If nobody has any objections to this I will commit it tomorrow morning
        Hide
        ab Andrzej Bialecki added a comment -

        The name of the property is somewhat misleading, because it applies to Web GUI and the OpenSearch servlet. Can we come up with a better name (and shorter too )?

        Also, this patch doesn't solve the whole issue, though it addresses the specific scenario described by the reporter. In general, even if hitsPerPage is small, it is still very expensive to retrieve a page of results far down the list, e.g. results 1000-10010. Currently Nutch will attempt to retrieve 10 results no matter what is the starting point, which represents a potential way to launch a DoS attack. Still, we can first fix this issue, and address this problem in a new issue.

        Show
        ab Andrzej Bialecki added a comment - The name of the property is somewhat misleading, because it applies to Web GUI and the OpenSearch servlet. Can we come up with a better name (and shorter too )? Also, this patch doesn't solve the whole issue, though it addresses the specific scenario described by the reporter. In general, even if hitsPerPage is small, it is still very expensive to retrieve a page of results far down the list, e.g. results 1000-10010. Currently Nutch will attempt to retrieve 10 results no matter what is the starting point, which represents a potential way to launch a DoS attack. Still, we can first fix this issue, and address this problem in a new issue.
        Hide
        musepwizard Dennis Kubes added a comment -

        Do you mean when you do a query on say the second page and the max is 1000 that the query actually searches for 2000 results, because I noticed this as well. Although don't know what would be the way to prevent this, except maybe not allowing that deep of a search.

        Show
        musepwizard Dennis Kubes added a comment - Do you mean when you do a query on say the second page and the max is 1000 that the query actually searches for 2000 results, because I noticed this as well. Although don't know what would be the way to prevent this, except maybe not allowing that deep of a search.
        Hide
        musepwizard Dennis Kubes added a comment -

        Updated patch, changed the name to searcher.max.hits.per.page (yes still long but best I could come up with given the givens), also updates patch to the current SVN. This has been tested and run through fetch and search cycles on linux.

        Show
        musepwizard Dennis Kubes added a comment - Updated patch, changed the name to searcher.max.hits.per.page (yes still long but best I could come up with given the givens), also updates patch to the current SVN. This has been tested and run through fetch and search cycles on linux.
        Hide
        ab Andrzej Bialecki added a comment -

        +1 on the patch. Yes, if a user requests page number 1000, and hitsPerPage is 10, then Nutch has to retrieve at least 10010 hits (without even considering the site de-duping!), discard the first 10000, and retrieve HitDetails for the last 10 ones. So I think that in any case Nutch should limit the maximum hit number to a reasonable value (default to a few thousands). You can try to retrieve results above 1000 from any major search engine to see that they all implement such limits.

        Show
        ab Andrzej Bialecki added a comment - +1 on the patch. Yes, if a user requests page number 1000, and hitsPerPage is 10, then Nutch has to retrieve at least 10010 hits (without even considering the site de-duping!), discard the first 10000, and retrieve HitDetails for the last 10 ones. So I think that in any case Nutch should limit the maximum hit number to a reasonable value (default to a few thousands). You can try to retrieve results above 1000 from any major search engine to see that they all implement such limits.
        Hide
        musepwizard Dennis Kubes added a comment -

        I just committed this. Thanks Emilijan Mirceski and Susam Pal.

        Show
        musepwizard Dennis Kubes added a comment - I just committed this. Thanks Emilijan Mirceski and Susam Pal.
        Hide
        hudson Hudson added a comment -
        Show
        hudson Hudson added a comment - Integrated in Nutch-trunk #363 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/363/ )
        Show
        markus17 Markus Jelsma added a comment - Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

          People

          • Assignee:
            musepwizard Dennis Kubes
            Reporter:
            emilijan Emilijan Mirceski
          • Votes:
            4 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development