[PDFBOX-3804] Detect end of paragraphs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.0.6, 2.0.7, 3.0.0 PDFBox
Fix Version/s: None
Component/s: Text extraction
Labels:
- extraction
- paragraph

Flags:

Important

Description

Hi,

To extract text by paragraphs is probably the most looking forward improvement asked by PDFBox users.

The current text extraction approach detects correctly end of lines. But it does not detect correctly end of paragraphs.

What is a paragraph ? A paragraph is a text that contains one or several sentences. It can start by a tabulation but this is not mandatory. In a paragraph, there is one or more lines but there is no carriage return (except the one at the very end). A paragraph can end before the very end of a line, but some paragraphs end at the very end. If a paragraph ends at the very end there is no new lines containing words after.

So, the last line of a paragraph ends before reaching the very end of the line except if there is no new lines containing words after it. Do you follow me ? And an algorithm could use that pattern to detect properly paragraphs.

In my opinion, the algorithm should use the following information:
the width of the block containing the paragraph ;
precomputed width of the first word in the next line.

The width of a block refers to the width of the area that contains the line that contains the character the algorithm is evaluating at any steps.

The algorithm runs on every character and when it reaches the last character of a line, it pre computes the first word of the next line to have it's width.
If this word fits in the previous line after the last character, then the algorithm concludes an end of paragraph (case 1).
If there is no next word, then this is also the end of the paragraph (case 2).
If there is a tabulation before the next word (case 3).
If the last character is far from the end of the block, we automatically conclude for the end of a paragraph (case 4 is optional).

Cheers,
A.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

example.pdf
22/May/17 14:48
465 kB
Alexandre
PDFBOX-3804-115spaced.pdf
01/Jun/17 18:03
94 kB
Tilman Hausherr
PDFBOX-3804-singlespaced.pdf
01/Jun/17 18:03
94 kB
Tilman Hausherr
PDFBOX-3804-noimage.pdf
01/Jun/17 18:03
55 kB
Tilman Hausherr

Activity

People

Assignee:: Unassigned

Reporter:: Alexandre

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/May/17 14:48

Updated:: 15/Jun/17 15:27

Resolved:: 15/Jun/17 15:27