Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Not A Problem
-
1.26
-
None
-
None
-
Windows 10 x64
-
Important
Description
When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.
I am sending the PDF using a POST request to the Tika Server running locally at url http://localhost:9998/tika with the PDF attached to the body of the message and headers
Content-Type : application/pdf
X-Tika-PDFextractInlineImages : true
X-Tika-PDFOcrStrategy: ocr_and_text_extraction
An attached PDF document is provided as an example
The output looks like this, incorrect text is in red text
PPAATIENTTIENT
DISEASE Lung cancer (NOS)
NAME
DATE OF BIRTH
SEX Male
MEDICAL RECORD # Not given
PHYPHYSICIANSICIAN
ORDERING PHYSICIAN
MEDICAL FACILITY
ADDITIONAL RECIPIENT None
MEDICAL FACILITY ID
PATHOLOGIST Not Provided
SPESPECIMENCIMEN
SPECIMEN ID
SPECIMEN TYPE Blood
DATE OF COLLECTION
SPECIMEN RECEIVED
MEDIAN EXON COVERAGE
Biomarker Findings
MSI SMSI Statatus Undettus Undetermined.ermined.