Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-481

http.content.limit is broken in the protocol-httpclient plugin

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 1.0.0
    • fetcher
    • None

    Description

      When using the protocol-httpclient plugin, the entire contents of the request URL is retrieved, regardless of the http.content.limit configuration setting. (The issue does not affect the protocol-http plugin.)

      For very large documents, this leads the Fetcher to believe that the FetcherThread is hung, and the Fetcher aborts its run, logging a warning about hung threads (Fetcher.java:433).

      org.apache.nutch.protocol.httpclient.HttpResponse is properly counting the content length, and is breaking its read loop at the proper point.

      However, when HttpResponse closes the InputStream from which it is reading, the InputStream object (an org.apache.commons.httpclient.AutoCloseInputStream) continues to read all of the content of the document from the webserver.

      Though I'm not certain this is the correct solution, a quick test shows that if HttpResponse is changed to abort the GET, the InputStream correctly aborts the read from the webserver, and the FetcherThread can continue.

      Attachments

        1. abortatcontentlimit.patch
          0.8 kB
          charlie wanek

        Issue Links

          Activity

            People

              dogacan Dogacan Guney
              mrbalky charlie wanek
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: