Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2707

protocol-okhttp fails to decompress content if Content-Encoding header is wrong

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.15
    • Fix Version/s: 1.19
    • Component/s: plugin, protocol
    • Labels:
      None

      Description

      The plugin protocol-okhttp does not decompress the returned gzipped content for some rare pages. Looks like that happens because the response HTTP header does not specify Content-Type: gzip but zlib,gzip,deflate.

      % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \
            -Dstore.http.headers=true -Dstore.http.request=true \
            http://24310.gr/afroditi-42426.html
      fetching: http://24310.gr/afroditi-42426.html 
      ...
      contentType: application/gzip
      ...
      Content Metadata: Transfer-Encoding=chunked ... Content-Encoding=zlib,gzip,deflate ... _request_=GET /afroditi-42426.html HTTP/1.1
      ...
      Accept-Encoding: gzip
      
       _response.headers_=HTTP/1.1 200 OK
      ...
      Content-Encoding: zlib,gzip,deflate
      ...
      Transfer-Encoding: chunked
      Connection: keep-alive
      

      The plugin protocol-http requests Accept-Encoding: x-gzip, gzip, deflate and gets the correct response header:

      % bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' \
             -Dstore.http.headers=true -Dstore.http.request=true http://24310.gr/afroditi-42426.html
      ...
      contentType: application/xhtml+xml
      ...
      Content Metadata: ... Content-Encoding=gzip ... _request_=GET /afroditi-42426.html HTTP/1.1
      Host: 24310.gr
      Accept-Encoding: x-gzip, gzip, deflate
      ...
      

      Similar for Firefox which sends Accept-Encoding: gzip, deflate.

      I will report the issue to upstream okhttp. But it would be also possible to handle the content encoding in the protocol implementation: if the Accept-Encoding header is set, the okhttp library will not decompress the content and expects that it's handled in the calling code.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              snagel Sebastian Nagel
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: