Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2760

protocol-okhttp: properly record HTTP version in request message header

    XMLWordPrintableJSON

    Details

      Description

      The HTTP version in the request message tracked by the plugin protocol-okhttp (store.http.request=true) is not the version sent in the request but that received from the response.

      Note that the HTTP version sent in the request may differ from that sent back in the response. One example (tracked using wget):

      > wget -d https://www.kp.ru/daily/27061/4129507/
      ...
      ---request begin---
      GET /daily/27061/4129507/ HTTP/1.1
      User-Agent: Wget/1.20.3 (linux-gnu)
      Accept: */*
      Accept-Encoding: identity
      Host: www.kp.ru
      Connection: Keep-Alive
      
      ---request end---
      HTTP request sent, awaiting response... 
      ---response begin---
      HTTP/1.0 200 OK
      ...
      

      protocol-http uses the response version ("HTTP/1.0") also for the request:

      > bin/nutch parsechecker -Dstore.http.headers=true -Dstore.http.request=true \
           -Dplugin.includes='protocol-okhttp|parse-html' https://www.kp.ru/daily/27061/4129507/
      ...
      _request_=GET /daily/27061/4129507/ HTTP/1.0
      ...
      _response.headers_=HTTP/1.0 200 OK
      ...
      

      The protocol-http tracks the versions correctly:

      > bin/nutch parsechecker -Dstore.http.headers=true -Dstore.http.request=true \
           -Dplugin.includes='protocol-http|parse-html' https://www.kp.ru/daily/27061/4129507/
      ...
      _request_=GET /daily/27061/4129507/ HTTP/1.1
      ...
      _response.headers_=HTTP/1.0 200 OK
      ...
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                snagel Sebastian Nagel
                Reporter:
                snagel Sebastian Nagel
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: