[TIKA-4277] PDF parse issue for text rotated - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.0.0-BETA, 2.9.2
Fix Version/s: None
Component/s: tika-app, tika-server
Labels:
- config.xml

Flags:

Important

Description

the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta

The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in server version and the standalone.

if the text is rotated 90. The parsed result will have a line break after each letter of word. It happened to symbol, English letters, and JCK characters.

In the server version, curl -g -T "sample2.pdf"
http://localhost:889/tika
--header "Accept: text/plain"

In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" --text

Both of above, deliver the the incorrect result in the attached pdf.

The output result is below

i
n
s
e
r
t

t
e
x
t

p
r
o
b
l
e
m

insert text problem

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sample2.pdf
11/Jul/24 11:46
11 kB
ragebear
OtherPDFReader.png
11/Jul/24 11:51
326 kB
ragebear

Issue Links

duplicates

TIKA-2779 Integrate/parameterize new rotated text handling in PDFBox

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: ragebear

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Jul/24 11:48

Updated:: 12/Jul/24 07:11

Resolved:: 12/Jul/24 03:45