Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2048

TextExtraction only working after uncompressing with pdftk

    XMLWordPrintableJSON

Details

    Description

      From Jonas Karlsson on the user list:
      ===
      We have a user with PDFs generated by a commercial transcription service.
      When we try to extract text from these pdfs, pdfbox returns a few empty
      lines. We get this result both from our own code, and when using the
      ExtractText command line tool

      If I specify the non-sequential parser, with the -nonSeq flag, the
      following error is produced:

      Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
      validateStreamLength

      SEVERE: The end of the stream doesn't point to the correct offset, using
      workaround to read the stream

      If I uncompress the file with pdftk, pdfbox is able to successfully extract
      the text.
      ===

      I have been given permission to attach the file "committers only". So don't pass it around, avoid quoting details from the file. The file is also not rendering. The lengths of the streams are 0.

      Attachments

        Activity

          People

            tilman Tilman Hausherr
            tilman Tilman Hausherr
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: