Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3364

PDF Content is extracted twice

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.26
    • None
    • parser
    • None

    Description

      Hi

      Coming from this issue in FSCrawler project, I can see that the text from the PDF document is extracted more than once although PDFBox seems to extract it only once.

      I attached the PDF.

      When I run:

      wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
      java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
      

      I'm getting:

      Dummy PDF file
      

      But with Tika:

      wget https://downloads.apache.org/tika/tika-app-1.26.jar
      java -jar tika-app-1.26.jar
      

      I'm getting:

      <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
      <head>
      <meta name="pdf:PDFVersion" content="1.4"/>
      <meta name="xmp:CreatorTool" content="Writer"/>
      <meta name="pdf:hasXFA" content="false"/>
      <meta name="access_permission:modify_annotations" content="true"/>
      <meta name="access_permission:can_print_degraded" content="true"/>
      <meta name="dc:creator" content="Evangelos Vlachogiannis"/>
      <meta name="dcterms:created" content="2007-02-23T15:56:37Z"/>
      <meta name="dc:format" content="application/pdf; version=1.4"/>
      <meta name="pdf:docinfo:creator_tool" content="Writer"/>
      <meta name="access_permission:fill_in_form" content="true"/>
      <meta name="pdf:encrypted" content="false"/>
      <meta name="Content-Length" content="13264"/>
      <meta name="X-TIKA:digest:MD5" content="2942bfabb3d05332b66eb128e0842cff"/>
      <meta name="pdf:hasMarkedContent" content="false"/>
      <meta name="Content-Type" content="application/pdf"/>
      <meta name="pdf:docinfo:creator" content="Evangelos Vlachogiannis"/>
      <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
      <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
      <meta name="creator" content="Evangelos Vlachogiannis"/>
      <meta name="meta:author" content="Evangelos Vlachogiannis"/>
      <meta name="meta:creation-date" content="2007-02-23T15:56:37Z"/>
      <meta name="created" content="2007-02-23T15:56:37Z"/>
      <meta name="X-TIKA:digest:SHA256" content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
      <meta name="access_permission:extract_for_accessibility" content="true"/>
      <meta name="access_permission:assemble_document" content="true"/>
      <meta name="xmpTPg:NPages" content="1"/>
      <meta name="Creation-Date" content="2007-02-23T15:56:37Z"/>
      <meta name="resourceName" content="issue-1097.pdf"/>
      <meta name="pdf:hasXMP" content="false"/>
      <meta name="access_permission:extract_content" content="true"/>
      <meta name="access_permission:can_print" content="true"/>
      <meta name="Author" content="Evangelos Vlachogiannis"/>
      <meta name="producer" content="OpenOffice.org 2.1"/>
      <meta name="access_permission:can_modify" content="true"/>
      <meta name="pdf:docinfo:producer" content="OpenOffice.org 2.1"/>
      <meta name="pdf:docinfo:created" content="2007-02-23T15:56:37Z"/>
      <title/>
      </head>
      <body><div class="page"><p/>
      <p>Dummy PDF file</p>
      <p/>
      </div>
      <ul>	<li>Dummy PDF file</li>
      </ul>
      </body></html>
      

      Attachments

        1. issue-1097.pdf
          13 kB
          David Pilato
        2. Screenshot from 2021-04-23 10-15-22.png
          24 kB
          Tim Allison
        3. tika-bookmarks-config.xml
          1 kB
          Tim Allison

        Activity

          People

            Unassigned Unassigned
            dadoonet David Pilato
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: