Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3800

I extract text of a pdf using PDFTextStripper and part of the text is missing.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 2.0.6, 2.0.7
    • None
    • Text extraction
    • None
    • Mac OS x under Eclipse

    Description

      Hi,

      I am quite unfamiliar with PDFbox. Still, I spent some time trying to figure out to solve the following issue.

      There is an issue for the pdf in attachment while extracting its text. Indeed, as you can see the pdf contains the text "Mapping Twitter topic networks: ... " until "... hub and spokes". But the result of PDFTextStripper getText() does not contain any of these characters.

      I checked and the community has already fixed similar bugs in the past.

      Any help will be delighted.

      Cheers,
      A.

      Attachments

        1. Smith.pdf
          5.57 MB
          Alexandre
        2. PDFDebugger-screenshot.png
          139 kB
          Tilman Hausherr

        Activity

          People

            Unassigned Unassigned
            arelaxend Alexandre
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: