Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1155

Web connector should not be sending the port number in request header field Host

    XMLWordPrintableJSON

    Details

      Description

      The web connector sends the port number in the request header field Host (e.g. Host: www.apache.org:443). This causes redirect rules for the host name to fail. The port number should not be part of the Host header.

      On the other hand RFC 2616 section 14.23 (http://tools.ietf.org/html/rfc2616#section-14.23) says “The Host request-header field specifies the Internet host and port number of the resource being requested [...]”.

      I encountered this issue while trying to crawl a customer’s website. The very first call to the seed URL caused a redirect which contained a link to the original URL itself and the job ended without fetching anything. The Simple History showed Status 301, that's it. Maybe the web connector does not follow the link in the redirect correctly?

      The redirect couldn't be triggered otherwise: I tried a browser and cURL. ManifoldCF's web connector was the only one sending the port number with the Host header and wasn't able to crawl the website due to this behavior.

      This issue could be worked around collaborating with the contractor which hosted the customer's website. He added an exception for these requests. But in general, I think this should be fixed, as such collaboration is not always possible.

        Attachments

          Activity

            People

            • Assignee:
              kwright@metacarta.com Karl Wright
              Reporter:
              Denis Beck Denis Beck
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: