|
[
Permlink
| « Hide
]
Doug Cutting added a comment - 19/Apr/05 01:36 AM
I agree. There should be a limit in the config file. By default the limit should be 1000 hits. A patch, anyone?
I am working on some code i will submit over the weekend to set a max value for hits per page.
I discovered this to be a serious issue with the opensearch as well since some people were sucking down wayyyyy too many records! Byron, have you made any progress with this?
hi,
any progress on this? +1 on this. If nobody has any objections to this I will commit it tomorrow morning
The name of the property is somewhat misleading, because it applies to Web GUI and the OpenSearch servlet. Can we come up with a better name (and shorter too
Also, this patch doesn't solve the whole issue, though it addresses the specific scenario described by the reporter. In general, even if hitsPerPage is small, it is still very expensive to retrieve a page of results far down the list, e.g. results 1000-10010. Currently Nutch will attempt to retrieve 10 results no matter what is the starting point, which represents a potential way to launch a DoS attack. Still, we can first fix this issue, and address this problem in a new issue. Do you mean when you do a query on say the second page and the max is 1000 that the query actually searches for 2000 results, because I noticed this as well. Although don't know what would be the way to prevent this, except maybe not allowing that deep of a search.
Updated patch, changed the name to searcher.max.hits.per.page (yes still long but best I could come up with given the givens), also updates patch to the current SVN. This has been tested and run through fetch and search cycles on linux.
+1 on the patch. Yes, if a user requests page number 1000, and hitsPerPage is 10, then Nutch has to retrieve at least 10010 hits (without even considering the site de-duping!), discard the first 10000, and retrieve HitDetails for the last 10 ones. So I think that in any case Nutch should limit the maximum hit number to a reasonable value (default to a few thousands). You can try to retrieve results above 1000 from any major search engine to see that they all implement such limits.
I just committed this. Thanks Emilijan Mirceski and Susam Pal.
Integrated in Nutch-trunk #363 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/363/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||