Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3706

Add a parser for HTTPResponse?

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • None
    • 2.4.0
    • None
    • None

    Description

      In the recent regression tests, we found a small handful of docs now identified as rfc822.

       

      One example comes from PDFBox's jira (https://issues.apache.org/jira/browse/PDFBOX-2976 ):

      https://issues.apache.org/jira/secure/attachment/12757260/sc-356376.pdf

       

      As Tilman notes on the issue, the PDF actually includes http headers before the PDF:

      HTTP/1.1 200 OK
      Cache-Control: private
      Pragma: Public
      Content-Type: application/pdf; charset=UTF-8
      Server: Microsoft-IIS/7.5
      Set-Cookie: ASP.NET_SessionId=ibc3nfydvyfh1z55zqis2q3y; path=/; HttpOnly
      Content-Disposition: inline; filename=_MTR_AGHS_EN.pdf
      X-AspNet-Version: 2.0.50727
      X-Powered-By: ASP.NET
      Date: Fri, 18 Sep 2015 17:30:08 GMT
      Content-Length: 56779
      
      %PDF-1.4 
      

      I'm not sure how or if we want to fix these. I'm going to look at the others. Y, others also come w HTTP Headers: https://corpora.tika.apache.org/base/docs/commoncrawl3/QL/QLPQA77R36REFEF3ICLL2NPTXWJXKV54

      Attachments

        Activity

          People

            Unassigned Unassigned
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: