Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1155

Web connector should not be sending the port number in request header field Host

    XMLWordPrintableJSON

Details

    Description

      The web connector sends the port number in the request header field Host (e.g. Host: www.apache.org:443). This causes redirect rules for the host name to fail. The port number should not be part of the Host header.

      On the other hand RFC 2616 section 14.23 (http://tools.ietf.org/html/rfc2616#section-14.23) says “The Host request-header field specifies the Internet host and port number of the resource being requested [...]”.

      I encountered this issue while trying to crawl a customer’s website. The very first call to the seed URL caused a redirect which contained a link to the original URL itself and the job ended without fetching anything. The Simple History showed Status 301, that's it. Maybe the web connector does not follow the link in the redirect correctly?

      The redirect couldn't be triggered otherwise: I tried a browser and cURL. ManifoldCF's web connector was the only one sending the port number with the Host header and wasn't able to crawl the website due to this behavior.

      This issue could be worked around collaborating with the contractor which hosted the customer's website. He added an exception for these requests. But in general, I think this should be fixed, as such collaboration is not always possible.

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            Denis Beck Denis Beck
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: