[PDFBOX-800] Wrong text extract from vertical textboxes in pdf files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.7.0
Fix Version/s: None
Component/s: Text extraction
Labels:
None
Environment:
Windows 7, VS 2010 C#, Tika Library

Description

Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
For example if there is a vertical textbox "hello" in a pdf file (Unable to render embedded object: File (WITHOUT) not found. line breaks):
H
E
L
L
O
the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
Is there a option to avoid this problem?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

problemdoc.doc
24/Aug/10 08:17
26 kB
Sandor Dj
problemdoc.pdf
24/Aug/10 08:17
128 kB
Sandor Dj

Issue Links

is duplicated by

PDFBOX-2879 Wrong vertical text extraction for apache PDFBox 2.0.0

Closed

relates to

PDFBOX-2272 Can't extract vertical text correctly

Open

Activity

People

Assignee:: Unassigned

Reporter:: Sandor Dj

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Aug/10 06:39

Updated:: 13/Jul/15 16:38