Details
Description
PdfParser returns self-closing tags for
<a/>
and
<p/>
, which is not html supported and does not render correctly in any browsers.
<a href="https://wiki.apache.org/tika/TikaJAXRS"/>
in the example below should be
<a ref="https://wiki.apache.org/tika/TikaJAXRS"></a>
We have tested both pdf converted from word and google documents with the same results. This is an example output that we get when parsing a pdf-document with a link:
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="date" content="2016-11-07T07:51:14Z"/> <meta name="pdf:PDFVersion" content="1.5"/> <meta name="xmp:CreatorTool" content="Microsoft® Word 2016"/> <meta name="access_permission:modify_annotations" content="true"/> <meta name="access_permission:can_print_degraded" content="true"/> <meta name="dcterms:created" content="2016-11-07T07:51:14Z"/> <meta name="Last-Modified" content="2016-11-07T07:51:14Z"/> <meta name="dcterms:modified" content="2016-11-07T07:51:14Z"/> <meta name="dc:format" content="application/pdf; version=1.5"/> <meta name="xmpMM:DocumentID" content="uuid:7C86A62C-A4B2-464A-AAEC-5524E170E2AF"/> <meta name="Last-Save-Date" content="2016-11-07T07:51:14Z"/> <meta name="access_permission:fill_in_form" content="true"/> <meta name="meta:save-date" content="2016-11-07T07:51:14Z"/> <meta name="pdf:encrypted" content="false"/> <meta name="modified" content="2016-11-07T07:51:14Z"/> <meta name="Content-Type" content="application/pdf"/> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/> <meta name="meta:creation-date" content="2016-11-07T07:51:14Z"/> <meta name="created" content="Mon Nov 07 07:51:14 UTC 2016"/> <meta name="access_permission:extract_for_accessibility" content="true"/> <meta name="access_permission:assemble_document" content="true"/> <meta name="xmpTPg:NPages" content="1"/> <meta name="Creation-Date" content="2016-11-07T07:51:14Z"/> <meta name="access_permission:extract_content" content="true"/> <meta name="access_permission:can_print" content="true"/> <meta name="producer" content="Microsoft® Word 2016"/> <meta name="access_permission:can_modify" content="true"/> <title></title> </head> <body> <div class="page"> <p/> <p>This is a word document, converted to pdf. </p> <p>Example link: https://wiki.apache.org/tika/TikaJAXRS </p> <p> </p> <p/> <div class="annotation"> <a href="https://wiki.apache.org/tika/TikaJAXRS"/> </div> </div> </body> </html>
Attachments
Issue Links
- is related to
-
TIKA-2029 Add link string to hrefs in PDF
- Open