Details
-
Task
-
Status: Resolved
-
Trivial
-
Resolution: Fixed
-
None
-
None
-
None
Description
In the recent regression tests, we found a small handful of docs now identified as rfc822.
One example comes from PDFBox's jira (https://issues.apache.org/jira/browse/PDFBOX-2976 ):
https://issues.apache.org/jira/secure/attachment/12757260/sc-356376.pdf
As Tilman notes on the issue, the PDF actually includes http headers before the PDF:
HTTP/1.1 200 OK Cache-Control: private Pragma: Public Content-Type: application/pdf; charset=UTF-8 Server: Microsoft-IIS/7.5 Set-Cookie: ASP.NET_SessionId=ibc3nfydvyfh1z55zqis2q3y; path=/; HttpOnly Content-Disposition: inline; filename=_MTR_AGHS_EN.pdf X-AspNet-Version: 2.0.50727 X-Powered-By: ASP.NET Date: Fri, 18 Sep 2015 17:30:08 GMT Content-Length: 56779 %PDF-1.4
I'm not sure how or if we want to fix these. I'm going to look at the others. Y, others also come w HTTP Headers: https://corpora.tika.apache.org/base/docs/commoncrawl3/QL/QLPQA77R36REFEF3ICLL2NPTXWJXKV54