Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2138

Corrupted words when using PDFTextStripper

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.8.5, 1.8.6, 2.0.0
    • 4.0.0
    • Text extraction
    • None
    • Windows 7 / 64 bit

    Description

      >> I am using PDFTextStripper (embedded into another application) to get
      >> the raw text of PDFs so far with good results but recently a PDF file
      >> has appeared where the output of the PDFTextStripper was corrupted. I
      >> got sentences like:
      >>
      >>
      >>
      >> "There is al o con ern that b nkers may be pushed to misprice risk
      >> (No. 6) by the pres ures of c mpetition and an abunda ce of central b
      >> nk-provided liquidity."

      > Additionally some portions of text appear
      > twice in the output: first correctly and then corrupted. I have
      > attached an output created with PDFBox's command line options.
      > If you compare lines 357- 365 with lines 421-429 you see that it is
      > the same paragraph, first ok and then with characters missing. In the
      > original source this paragraph is unique.
      > The same seems to happen for the other instances where text is corrupted.

      I also tried it directly on the command line with the same results: input and output files are attached.

      Attachments

        1. banking-banana-skins-2014.pdf
          1.53 MB
          Walter Kehl
        2. banking-banana-skins-2014.txt
          219 kB
          Walter Kehl
        3. PDFBOX-2138.pdf
          384 kB
          Tilman Hausherr
        4. PDFBOX-2138.txt
          10 kB
          Tilman Hausherr
        5. PDFBOX-2138-noClip.pdf
          329 kB
          Michael Klink
        6. PDFBOX-2138-noClip.png
          64 kB
          Michael Klink

        Issue Links

          Activity

            People

              Unassigned Unassigned
              waltkeh Walter Kehl
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: