Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5002

PDFTextStripper sometimes fuses two words on different lines

    XMLWordPrintableJSON

Details

    Description

      This happens when a text in a big font is followed by at least two lines of text in a smaller font: the last word of the first line is merged with the first word of the second line.

      On the attached PDF, the extracted text is :

      (...) some text awith smaller font (...)

      instead of:

       

      (...) some text with a smaller font (...)
      

      I often encounter this kind of problem on invoices, where the company address (small text at the top right) is next to the company name & logo (big centered text at the top).

       

      Attachments

        1. small&Big.pdf
          0.9 kB
          Thierry Guérin
        2. PDFBOX-756-p1.pdf
          50 kB
          Tilman Hausherr
        3. PDFBOX-4550-pdnekz1gvl7.pdf
          74 kB
          Tilman Hausherr
        4. PDFBOX-3248-spaces.pdf
          200 kB
          Tilman Hausherr
        5. PDFBOX-3062-005021.pdf
          65 kB
          Tilman Hausherr
        6. artikel1_20_arab.pdf
          1.55 MB
          Tilman Hausherr
        7. 001991.pdf
          17 kB
          Tilman Hausherr

        Issue Links

          Activity

            People

              tilman Tilman Hausherr
              tguerin Thierry Guérin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: