[PDFBOX-4313] PDFTextStripper groups unrelated chunks into words - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.11
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

I have the text "10" and "11" and they get merged into to "1110" word.

Coordinates are:

1 575.36 x 227.4 w 4.447998 h 5.736
1 579.752 x 227.4 w 4.447998 h 5.736
1 526.2 x 227.4 w 4.447998 h 5.736
0 530.59204 x 227.4 w 4.447998 h 5.736

The bug is in in this PDFTextStripper chunk:

{{
// test if our TextPosition starts after a new word would be expected to start
if (expectedStartOfNextWordX != EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
&& expectedStartOfNextWordX < positionX &&
// only bother adding a space if the last character was not a space
lastPosition.getTextPosition().getUnicode() != null
&& !lastPosition.getTextPosition().getUnicode().endsWith(" "))

{ line.add(LineItem.getWordSeparator()); }

}}

which seems to add a word separator only if the next char is "after" the current word. It never expects that the next char might be "before" the current word.

I guess this could also be framed as a RTL problem, but the PDF is a plain PDF, it just seems that Oracle Reports generates these chunks in the reverse order.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1536938716546.pdf
14/Sep/18 15:25
1 kB
Emilian Bold
crop-fisa-sintetica.png
08/Sep/18 09:35
17 kB
Emilian Bold
details.pdf
22/Aug/19 14:27
18 kB
Paul Slootweg
PDFBOX-4313.pdf
11/Sep/18 16:04
0.8 kB
Tilman Hausherr
PDFBOX-4313-Test_sorted.txt
23/Sep/18 09:59
0.3 kB
Andreas Lehmkühler
PDFBOX-4313-Test_unsorted.txt
23/Sep/18 09:59
0.3 kB
Andreas Lehmkühler
PDFBOX4313Test.java
21/Sep/18 17:21
7 kB
Emilian Bold
PDFBOX4313Test.java
14/Sep/18 15:27
7 kB
Emilian Bold
PDFBOX-4313-Test.pdf
23/Sep/18 09:58
1 kB
Andreas Lehmkühler
pdfbox-words.png
21/Sep/18 17:19
54 kB
Emilian Bold

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Emilian Bold

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Sep/18 09:15

Updated:: 22/Aug/19 14:28