[DROIDS-105] missing caching for robots.txt - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 0.3.0
Component/s: core
Labels:
None

Description

the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.

unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CachingContentLoader.java
23/Nov/10 13:31
2 kB
Paul Rogalinski
Caching-Support-and-Robots_txt-fix.patch
24/Nov/10 12:21
6 kB
Paul Rogalinski

Issue Links

is blocked by

DROIDS-103 incorrect robots.txt path

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Paul Rogalinski

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 23/Nov/10 13:30

Updated:: 15/Jun/12 02:46