Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5002

PDFTextStripper sometimes fuses two words on different lines

    XMLWordPrintableJSON

    Details

      Description

      This happens when a text in a big font is followed by at least two lines of text in a smaller font: the last word of the first line is merged with the first word of the second line.

      On the attached PDF, the extracted text is :

      (...) some text awith smaller font (...)

      instead of:

       

      (...) some text with a smaller font (...)
      

      I often encounter this kind of problem on invoices, where the company address (small text at the top right) is next to the company name & logo (big centered text at the top).

       

        Attachments

        1. artikel1_20_arab.pdf
          1.55 MB
          Tilman Hausherr
        2. PDFBOX-3248-spaces.pdf
          200 kB
          Tilman Hausherr
        3. PDFBOX-3062-005021.pdf
          65 kB
          Tilman Hausherr
        4. PDFBOX-756-p1.pdf
          50 kB
          Tilman Hausherr
        5. PDFBOX-4550-pdnekz1gvl7.pdf
          74 kB
          Tilman Hausherr
        6. 001991.pdf
          17 kB
          Tilman Hausherr
        7. small&Big.pdf
          0.9 kB
          Thierry Guérin

          Issue Links

            Activity

              People

              • Assignee:
                tilman Tilman Hausherr
                Reporter:
                tguerin Thierry Guérin
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: