Uploaded image for project: 'Droids'
  1. Droids
  2. DROIDS-105

missing caching for robots.txt

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 0.3.0
    • core
    • None

    Description

      the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.

      unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

      Attachments

        1. CachingContentLoader.java
          2 kB
          Paul Rogalinski
        2. Caching-Support-and-Robots_txt-fix.patch
          6 kB
          Paul Rogalinski

        Issue Links

          Activity

            People

              Unassigned Unassigned
              pulsar256 Paul Rogalinski
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated: