Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-456

PDFTextStripperByArea never finds any text (pageNo check in PDFTextStripper always returns false)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0-incubator
    • 0.8.0-incubator
    • Text extraction
    • None
    • 0.8.0-incubator as well as checkout from SVN (rev#767932).
      Not affected: lastest sf.net release (0.7.3)

    Description

      PDFTextStripperByArea does not return any text from pages.

      This is due to a check in PDFTextStripper#processPage() (first line) that compares the currentPageNo number (initially 0) against the startPage (initially 1). Since PDFTextStripperByArea does not set startPage and/or currentPage, this comparison always gives false and no text is extracted.

      A possible fix is to include the following code in PDFTextStripperByArea#extractRegions right before the call to processPage():
      setStartPage(0)
      setEndPage(0)

      Since I'm not very familiar with the inner PDFbox workings, this might be more of a hack than a solid fix.

      The issue was introduced in PDFTextStripper 1.70 (old SF.net CSV), where the currentPage++ was removed from just before the check in processPage().

      Attachments

        1. PDFBOX-456-patch-he.diff
          0.6 kB
          Hannes Erven

        Activity

          People

            Unassigned Unassigned
            hannibal218bc Hannes Erven
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: