[TIKA-3427] Duplicate characters in some words - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 1.26
Fix Version/s: None
Component/s: tika-server
Labels:
None
Environment:

Windows 10 x64

Flags:

Important

Description

When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.

I am sending the PDF using a POST request to the Tika Server running locally at url http://localhost:9998/tika with the PDF attached to the body of the message and headers

Content-Type : application/pdf

X-Tika-PDFextractInlineImages : true

X-Tika-PDFOcrStrategy: ocr_and_text_extraction

An attached PDF document is provided as an example

The output looks like this, incorrect text is in red text

PPAATIENTTIENT

DISEASE Lung cancer (NOS)
NAME
DATE OF BIRTH
SEX Male
MEDICAL RECORD # Not given

PHYPHYSICIANSICIAN

ORDERING PHYSICIAN
MEDICAL FACILITY
ADDITIONAL RECIPIENT None
MEDICAL FACILITY ID
PATHOLOGIST Not Provided

SPESPECIMENCIMEN

SPECIMEN ID
SPECIMEN TYPE Blood
DATE OF COLLECTION
SPECIMEN RECEIVED
MEDIAN EXON COVERAGE

Biomarker Findings
MSI SMSI Statatus Undettus Undetermined.ermined.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1_PDFsam_FoundationOne_Liquid_Sample_Report.pdf
31/May/21 19:21
93 kB
Sal

Activity

People

Assignee:: Unassigned

Reporter:: Sal

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 31/May/21 19:28

Updated:: 01/Jun/21 13:57

Resolved:: 01/Jun/21 13:57