Description
The plugin protocol-okhttp does not decompress the returned gzipped content for some rare pages. Looks like that happens because the response HTTP header does not specify Content-Type: gzip but zlib,gzip,deflate.
% bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \ -Dstore.http.headers=true -Dstore.http.request=true \ http://24310.gr/afroditi-42426.html fetching: http://24310.gr/afroditi-42426.html ... contentType: application/gzip ... Content Metadata: Transfer-Encoding=chunked ... Content-Encoding=zlib,gzip,deflate ... _request_=GET /afroditi-42426.html HTTP/1.1 ... Accept-Encoding: gzip _response.headers_=HTTP/1.1 200 OK ... Content-Encoding: zlib,gzip,deflate ... Transfer-Encoding: chunked Connection: keep-alive
The plugin protocol-http requests Accept-Encoding: x-gzip, gzip, deflate and gets the correct response header:
% bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' \ -Dstore.http.headers=true -Dstore.http.request=true http://24310.gr/afroditi-42426.html ... contentType: application/xhtml+xml ... Content Metadata: ... Content-Encoding=gzip ... _request_=GET /afroditi-42426.html HTTP/1.1 Host: 24310.gr Accept-Encoding: x-gzip, gzip, deflate ...
Similar for Firefox which sends Accept-Encoding: gzip, deflate.
I will report the issue to upstream okhttp. But it would be also possible to handle the content encoding in the protocol implementation: if the Accept-Encoding header is set, the okhttp library will not decompress the content and expects that it's handled in the calling code.