Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
3.0.0-BETA, 2.9.2
-
None
-
Important
Description
the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in server version and the standalone.
if the text is rotated 90. The parsed result will have a line break after each letter of word. It happened to symbol, English letters, and JCK characters.
In the server version, curl -g -T "sample2.pdf"
http://localhost:889/tika
--header "Accept: text/plain"
In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" --text
Both of above, deliver the the incorrect result in the attached pdf.
The output result is below
i
n
s
e
r
t
t
e
x
t
p
r
o
b
l
e
m
insert text problem
Attachments
Attachments
Issue Links
- duplicates
-
TIKA-2779 Integrate/parameterize new rotated text handling in PDFBox
- Resolved