[TIKA-3307] extracted text strings have repeated characters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Bug
Affects Version/s: None
Fix Version/s: None
Component/s: parser
Labels:
None

Description

Extracted text from some PDF files includes some strings with repeated (doubled) characters.

To reproduce the problem, download attached PDF file and run the following command:

java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep '(.)\1(.)\2'

The bad strings all seem to be headings, so perhaps something in the font or other style features are causing the problem.

First detected in version 1.19, retested with 1.25. Did not test earlier versions.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

WSHP-PRC025F-EN_07132019.pdf
01/Mar/21 17:42
12.86 MB
Paul Tyson

Activity

People

Assignee:: Unassigned

Reporter:: Paul Tyson

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Mar/21 17:47

Updated:: 11/Mar/21 18:26

Resolved:: 11/Mar/21 18:26