Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2549 protocol-http does not behave the same as browsers
  3. NUTCH-2557

protocol-http fails to follow redirections when an HTTP response body is invalid

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 1.15
    • Component/s: None
    • Labels:
      None

      Description

      If a server sends a redirection (3XX status code, with a Location header), protocol-http tries to parse the HTTP response body anyway. Thus, if an error occurs while decoding the body, the redirection is not followed and the information is lost. Browsers follow the redirection and close the socket soon as they can.

      • Example: this page is a redirection to its https version, with an HTTP body containing invalidly gzip encoded contents. Browsers follow the redirection, but nutch throws an error:

       

      The HttpResponse::getContent class can already return null. I think it should at least return null when parsing the HTTP response body fails.

      Ideally, we would adopt the same behavior as browsers, and not even try parsing the body when the headers indicate a redirection.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              gbouchar Gerard Bouchar
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: