Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3800

I extract text of a pdf using PDFTextStripper and part of the text is missing.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 2.0.6, 2.0.7
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Mac OS x under Eclipse

      Description

      Hi,

      I am quite unfamiliar with PDFbox. Still, I spent some time trying to figure out to solve the following issue.

      There is an issue for the pdf in attachment while extracting its text. Indeed, as you can see the pdf contains the text "Mapping Twitter topic networks: ... " until "... hub and spokes". But the result of PDFTextStripper getText() does not contain any of these characters.

      I checked and the community has already fixed similar bugs in the past.

      Any help will be delighted.

      Cheers,
      A.

        Attachments

        1. PDFDebugger-screenshot.png
          139 kB
          Tilman Hausherr
        2. Smith.pdf
          5.57 MB
          Alexandre

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              arelaxend Alexandre
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: