Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-456

PDFTextStripperByArea never finds any text (pageNo check in PDFTextStripper always returns false)

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0-incubator
    • 0.8.0-incubator
    • Text extraction
    • None
    • 0.8.0-incubator as well as checkout from SVN (rev#767932).
      Not affected: lastest sf.net release (0.7.3)

    Description

      PDFTextStripperByArea does not return any text from pages.

      This is due to a check in PDFTextStripper#processPage() (first line) that compares the currentPageNo number (initially 0) against the startPage (initially 1). Since PDFTextStripperByArea does not set startPage and/or currentPage, this comparison always gives false and no text is extracted.

      A possible fix is to include the following code in PDFTextStripperByArea#extractRegions right before the call to processPage():
      setStartPage(0)
      setEndPage(0)

      Since I'm not very familiar with the inner PDFbox workings, this might be more of a hack than a solid fix.

      The issue was introduced in PDFTextStripper 1.70 (old SF.net CSV), where the currentPage++ was removed from just before the check in processPage().

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            hannibal218bc Hannes Erven
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment