Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2168

Incorrect <a> and <p> parsing in PdfParser

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • 1.13
    • 1.14
    • parser, server
    • None
    • Running Tika server 1.13 and testing via http api

    Description

      PdfParser returns self-closing tags for

      <a/>

      and

      <p/>

      , which is not html supported and does not render correctly in any browsers.

      <a href="https://wiki.apache.org/tika/TikaJAXRS"/>

      in the example below should be

      <a ref="https://wiki.apache.org/tika/TikaJAXRS"></a>

      We have tested both pdf converted from word and google documents with the same results. This is an example output that we get when parsing a pdf-document with a link:

      <html xmlns="http://www.w3.org/1999/xhtml">
          <head>
              <meta name="date" content="2016-11-07T07:51:14Z"/>
              <meta name="pdf:PDFVersion" content="1.5"/>
              <meta name="xmp:CreatorTool" content="Microsoft&reg; Word 2016"/>
              <meta name="access_permission:modify_annotations" content="true"/>
              <meta name="access_permission:can_print_degraded" content="true"/>
              <meta name="dcterms:created" content="2016-11-07T07:51:14Z"/>
              <meta name="Last-Modified" content="2016-11-07T07:51:14Z"/>
              <meta name="dcterms:modified" content="2016-11-07T07:51:14Z"/>
              <meta name="dc:format" content="application/pdf; version=1.5"/>
              <meta name="xmpMM:DocumentID" content="uuid:7C86A62C-A4B2-464A-AAEC-5524E170E2AF"/>
              <meta name="Last-Save-Date" content="2016-11-07T07:51:14Z"/>
              <meta name="access_permission:fill_in_form" content="true"/>
              <meta name="meta:save-date" content="2016-11-07T07:51:14Z"/>
              <meta name="pdf:encrypted" content="false"/>
              <meta name="modified" content="2016-11-07T07:51:14Z"/>
              <meta name="Content-Type" content="application/pdf"/>
              <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
              <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
              <meta name="meta:creation-date" content="2016-11-07T07:51:14Z"/>
              <meta name="created" content="Mon Nov 07 07:51:14 UTC 2016"/>
              <meta name="access_permission:extract_for_accessibility" content="true"/>
              <meta name="access_permission:assemble_document" content="true"/>
              <meta name="xmpTPg:NPages" content="1"/>
              <meta name="Creation-Date" content="2016-11-07T07:51:14Z"/>
              <meta name="access_permission:extract_content" content="true"/>
              <meta name="access_permission:can_print" content="true"/>
              <meta name="producer" content="Microsoft&reg; Word 2016"/>
              <meta name="access_permission:can_modify" content="true"/>
              <title></title>
          </head>
          <body>
              <div class="page">
                  <p/>
                  <p>This is a word document, converted to pdf.  
      </p>
                  <p>Example link: https://wiki.apache.org/tika/TikaJAXRS 
      </p>
                  <p> </p>
                  <p/>
                  <div class="annotation">
                      <a href="https://wiki.apache.org/tika/TikaJAXRS"/>
                  </div>
              </div>
          </body>
      </html>
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              semihappycoder Sara Miller
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: