To extract text by paragraphs is probably the most looking forward improvement asked by PDFBox users.
The current text extraction approach detects correctly end of lines. But it does not detect correctly end of paragraphs.
What is a paragraph ? A paragraph is a text that contains one or several sentences. It can start by a tabulation but this is not mandatory. In a paragraph, there is one or more lines but there is no carriage return (except the one at the very end). A paragraph can end before the very end of a line, but some paragraphs end at the very end. If a paragraph ends at the very end there is no new lines containing words after.
So, the last line of a paragraph ends before reaching the very end of the line except if there is no new lines containing words after it. Do you follow me ?
In my opinion, the algorithm should use the following information:
the containing the paragraph ;
precomputed width of the .
The refers to the width of the area that contains the line that contains the character the algorithm is evaluating at any steps.
The algorithm runs on every character and when it reaches the , it pre computes to have it's width.
If fits in the previous line after the , then the algorithm concludes an end of paragraph (case 1).
If there is no , then this is also the end of the paragraph (case 2).
If there is a tabulation before the (case 3).
If the is far from the end of the block, we automatically conclude for the end of a paragraph (case 4 is optional).