Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-588

Problem extracting text in newline characters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0-incubator, 1.3.1, 1.4.0
    • 1.5.0
    • Text extraction
    • None
    • Win XP

    Description

      Hello ,

      I have a PDF file with 1 page only, when I try to extract its text using :
      String pageData = stripper.getText( pdfFile );

      It ignores some Enter characters between lines, so the last word in the line and the first word in the next line appear as 1 word without spaces between them !!

      While if I copy the PDF text manually from the PDF and paste it in a text editor, Enter characters appear after the same lines that caused the problem in PDFBox.

      Please check the attached file as a sample.

      Is there a way to fix this ?

      Best regards ,

      Attachments

        1. PDFBOX588-Enters-sample.txt
          3 kB
          Andreas Lehmkühler
        2. PDFBOX588-Enters-sample1.png
          785 kB
          Andreas Lehmkühler
        3. PDFBOX588-Enters-sample1.png
          785 kB
          Andreas Lehmkühler
        4. PDFTextStripper.patch
          2 kB
          Villu Ruusmann
        5. Enters-sample.pdf
          141 kB
          Hesham

        Activity

          People

            lehmi Andreas Lehmkühler
            hesham Hesham
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: