Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3427

Duplicate characters in some words

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • 1.26
    • None
    • tika-server
    • None
    • Windows 10 x64

    • Important

    Description

      When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.  

      I am sending the PDF using a POST request to the Tika Server running locally at url http://localhost:9998/tika with the PDF attached to the body of the message and headers 

      Content-Type : application/pdf

      X-Tika-PDFextractInlineImages : true

      X-Tika-PDFOcrStrategy: ocr_and_text_extraction

      An attached PDF document  is provided as an example

      The output looks like this, incorrect text is in red text

       

      PPAATIENTTIENT

      DISEASE Lung cancer (NOS)
      NAME
      DATE OF BIRTH
      SEX Male
      MEDICAL RECORD # Not given

      PHYPHYSICIANSICIAN

      ORDERING PHYSICIAN
      MEDICAL FACILITY
      ADDITIONAL RECIPIENT None
      MEDICAL FACILITY ID
      PATHOLOGIST Not Provided

      SPESPECIMENCIMEN

      SPECIMEN ID
      SPECIMEN TYPE Blood
      DATE OF COLLECTION
      SPECIMEN RECEIVED
      MEDIAN EXON COVERAGE

      Biomarker Findings
      MSI SMSI Statatus Undettus Undetermined.ermined.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sallas Sal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: