• Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 2.0.6, 2.0.7, 3.0.0 PDFBox
    • Fix Version/s: None
    • Component/s: Text extraction
    • Flags:



      To extract text by paragraphs is probably the most looking forward improvement asked by PDFBox users.

      The current text extraction approach detects correctly end of lines. But it does not detect correctly end of paragraphs.

      What is a paragraph ? A paragraph is a text that contains one or several sentences. It can start by a tabulation but this is not mandatory. In a paragraph, there is one or more lines but there is no carriage return (except the one at the very end). A paragraph can end before the very end of a line, but some paragraphs end at the very end. If a paragraph ends at the very end there is no new lines containing words after.

      So, the last line of a paragraph ends before reaching the very end of the line except if there is no new lines containing words after it. Do you follow me ? And an algorithm could use that pattern to detect properly paragraphs.

      In my opinion, the algorithm should use the following information:
      the width of the block containing the paragraph ;
      precomputed width of the first word in the next line.

      The width of a block refers to the width of the area that contains the line that contains the character the algorithm is evaluating at any steps.

      The algorithm runs on every character and when it reaches the last character of a line, it pre computes the first word of the next line to have it's width.
      If this word fits in the previous line after the last character, then the algorithm concludes an end of paragraph (case 1).
      If there is no next word, then this is also the end of the paragraph (case 2).
      If there is a tabulation before the next word (case 3).
      If the last character is far from the end of the block, we automatically conclude for the end of a paragraph (case 4 is optional).



        1. PDFBOX-3804-singlespaced.pdf
          94 kB
          Tilman Hausherr
        2. PDFBOX-3804-noimage.pdf
          55 kB
          Tilman Hausherr
        3. PDFBOX-3804-115spaced.pdf
          94 kB
          Tilman Hausherr
        4. example.pdf
          465 kB



            • Assignee:
              arelaxend Alexandre
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created: