Issue Details (XML | Word | Printable)

Key: NUTCH-44
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Dennis Kubes
Reporter: Emilijan Mirceski
Votes: 4
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Nutch

too many search results

Created: 18/Apr/05 06:38 AM   Updated: 19/Feb/08 04:44 PM
Return to search
Component/s: web gui
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works NUTCH-44-2-20080215.patch 2008-02-16 12:27 AM Dennis Kubes 3 kB
Text File Licensed for inclusion in ASF works NUTCH-44.patch 2007-09-08 11:24 AM Susam Pal 3 kB
Environment: web environment

Resolution Date: 18/Feb/08 06:39 AM


 Description  « Hide
There should be a limitation (user defined) on the number of results the search engine can return.

For example, if one modifies the seach url as:
http://<my>/search.jsp?query=<some quiery>&hitsPerPage=20000&hitsPerSite=0

The search will try to return 20,000 pages which isn't good for the server side performance.

Is it possible to have a setting in the config xml files to control this?

Thanks,
Emilijan



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doug Cutting added a comment - 19/Apr/05 01:36 AM
I agree. There should be a limit in the config file. By default the limit should be 1000 hits. A patch, anyone?

byron miller added a comment - 30/Apr/05 02:43 AM
I am working on some code i will submit over the weekend to set a max value for hits per page.

I discovered this to be a serious issue with the opensearch as well since some people were sucking down wayyyyy too many records!


Sami Siren added a comment - 01/Feb/06 04:22 AM
Byron, have you made any progress with this?

Stefan Neufeind added a comment - 25/May/06 12:52 AM
hi,
any progress on this?

Susam Pal added a comment - 08/Sep/07 09:55 AM
Attached a patch.

To apply:-

patch -p0 < NUTCH-44.patch
ant war
cp build/nutch*war $CATALINA_HOME/webapps/ROOT.war


Susam Pal made changes - 08/Sep/07 09:55 AM
Field Original Value New Value
Attachment NUTCH-44.patch [ 12365394 ]
Susam Pal made changes - 08/Sep/07 11:07 AM
Attachment NUTCH-44.patch [ 12365394 ]
Susam Pal added a comment - 08/Sep/07 11:24 AM
Updated my previous patch to fix the issue in opensearch too.

To apply:-

patch -p0 < NUTCH-44.patch
ant war
cp build/nutch*war $CATALINA_HOME/webapps/ROOT.war


Susam Pal made changes - 08/Sep/07 11:24 AM
Attachment NUTCH-44.patch [ 12365397 ]
Dennis Kubes made changes - 15/Feb/08 09:19 PM
Assignee Dennis Kubes [ musepwizard ]
Dennis Kubes added a comment - 15/Feb/08 09:27 PM
+1 on this. If nobody has any objections to this I will commit it tomorrow morning

Andrzej Bialecki added a comment - 15/Feb/08 09:53 PM
The name of the property is somewhat misleading, because it applies to Web GUI and the OpenSearch servlet. Can we come up with a better name (and shorter too )?

Also, this patch doesn't solve the whole issue, though it addresses the specific scenario described by the reporter. In general, even if hitsPerPage is small, it is still very expensive to retrieve a page of results far down the list, e.g. results 1000-10010. Currently Nutch will attempt to retrieve 10 results no matter what is the starting point, which represents a potential way to launch a DoS attack. Still, we can first fix this issue, and address this problem in a new issue.


Dennis Kubes added a comment - 16/Feb/08 12:24 AM
Do you mean when you do a query on say the second page and the max is 1000 that the query actually searches for 2000 results, because I noticed this as well. Although don't know what would be the way to prevent this, except maybe not allowing that deep of a search.

Dennis Kubes added a comment - 16/Feb/08 12:27 AM
Updated patch, changed the name to searcher.max.hits.per.page (yes still long but best I could come up with given the givens), also updates patch to the current SVN. This has been tested and run through fetch and search cycles on linux.

Dennis Kubes made changes - 16/Feb/08 12:27 AM
Attachment NUTCH-44-2-20080215.patch [ 12375733 ]
Andrzej Bialecki added a comment - 16/Feb/08 09:45 AM
+1 on the patch. Yes, if a user requests page number 1000, and hitsPerPage is 10, then Nutch has to retrieve at least 10010 hits (without even considering the site de-duping!), discard the first 10000, and retrieve HitDetails for the last 10 ones. So I think that in any case Nutch should limit the maximum hit number to a reasonable value (default to a few thousands). You can try to retrieve results above 1000 from any major search engine to see that they all implement such limits.

Repository Revision Date User Message
ASF #628631 Mon Feb 18 06:38:46 UTC 2008 kubes NUTCH-44 - Too many search results. Configurable limit on max number of search results returned. Thanks Emilijan Mirceski and Susam Pal.
Files Changed
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/OpenSearchServlet.java
MODIFY /lucene/nutch/trunk/src/web/jsp/search.jsp
MODIFY /lucene/nutch/trunk/conf/nutch-default.xml
MODIFY /lucene/nutch/trunk/CHANGES.txt

Dennis Kubes added a comment - 18/Feb/08 06:39 AM
I just committed this. Thanks Emilijan Mirceski and Susam Pal.

Dennis Kubes made changes - 18/Feb/08 06:39 AM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Hudson added a comment - 19/Feb/08 04:44 PM