Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.7.0
-
None
-
None
-
Windows 7, VS 2010 C#, Tika Library
Description
Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
For example if there is a vertical textbox "hello" in a pdf file (Unable to render embedded object: File (WITHOUT) not found. line breaks):
H
E
L
L
O
the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
Is there a option to avoid this problem?
Attachments
Attachments
Issue Links
- is duplicated by
-
PDFBOX-2879 Wrong vertical text extraction for apache PDFBox 2.0.0
- Closed
- relates to
-
PDFBOX-2272 Can't extract vertical text correctly
- Open