Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-269

ExtractText errors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.8.0-incubator
    • Text extraction
    • None

    Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1706491
      Originally submitted by wrwessel on 2007-04-24 04:31.

      Wrote a batch file to convert over 500 powerpoint files I had to pdf (using DocumentConverter.py and OpenOffice) then the batch file uses ExtractText.exe to extract the text. Most of these files converted fine but I had 4 files where ExtractText could not get any text and threw various error messages. I have attached one of these as a sample. Using version 0.7.4 from 19/5/07 and same problem with 0.7.3 release. It is easy enough for me to convert the last 4 by hand, but might be a bug you can fix.

      Many thanks for the ExtractText program, saved a long time converting these by hand.

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1706491&file_id=226382
      Sample.zip (application/x-zip-compressed), 216343 bytes

      [comment on SourceForge]
      Originally sent by benlitchfield.
      Logged In: YES
      user_id=601708
      Originator: NO

      I've looked at the attached PDF, technically I believe the root issue is that OpenOffice is not correctly writing the PDF. I have submitted the issue with those guys and can be monitored by going to http://www.openoffice.org/issues/show_bug.cgi?id=76879

      The issue is that the PDF is sometimes missing the 'endstream' tag; which PDFBox looks for to tell it that the stream is done.

      My rule of thumb is that if Acrobat can open it, then so should PDFBox, so this is still a 'bug' with PDFBox. Fixing this is possible but is not straightforward, so it may be a little bit before it is complete.

      Attachments

        1. endstream_missing_fix.diff
          6 kB
          Justin LeFebvre

        Activity

          People

            Unassigned Unassigned
            jukkaz Jukka Zitting
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: