Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2669

Tika JAX-RS PDF parser option / custom config issue

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.18
    • 1.19, 2.0.0
    • config
    • None

    Description

      PDF parsing using a config file behaves differently in Tika app than in Tika server. Tika server reads the custom config file, but the PDF parsing options are not being set. 

      Here is an excerpt of output from the app:

      <p>WINS No: B29017 APACHE 27-38 UNIT 1H Date: 5/4/2017

      </p>

      <p>AFE No: 1704555 Daily Completion and Workover Report DOL: 

      </p>

      However, with the same configuration file the output from tika server is:

      <p>Daily Completion and Workover Report

      </p>

      <p>WINS No: 

      </p>

      <p>AFE No: 

      </p>

      <p>Date: 

      </p>

      <p>DOL: 

      </p>

      <p>APACHE 27-38 UNIT B29017

      </p>

      <p>1704555

      </p>

      <p>5/4/2017

      </p>

       

       

      The tika config is:

      <?xml version="1.0" encoding="UTF-8"?>
      <properties>
      <parsers>
      <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
      <param name="sortByPosition" type="bool">true</param>
      </params>
      </parser>
      </parsers>
      </properties>

      Attachments

        Activity

          People

            tallison Tim Allison
            adidier Annie Didier
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: