Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1093

Copy Page from one Document to another: Page Content Stream Linked to Original Document

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.6.0
    • 1.8.2
    • Text extraction
    • None

    Description

      When a page is grabbed from one document and added to another (via addPage or importPage) the Content Stream of the page retains the scratchFile and unFiltered/FilteredStreams from it's original document. This means that a page is always connected to it's original document and not wholly a part of it's new document.

      The problem with this situation:
      -When searching for text within a large (800,000 page) pdf file performance can potentially be increased if the pdf file is split into single pages for incremental text extraction. Each page is searched individually rather than an entire document search. To achieve this, a new document is created and a single page from the original pdf is added.

      -When searching through these 1 page documents, the scratchFile of the original pdf is used, and it will grow as the text from each page is extracted. This leads to an out of memory condition, which appears as a "SEVERE Stop reading corrupt stream" exception from doDecode() as the write buffer attempts to expand to a size greater than the maximum heap size.

      A workaround for this problem is to create a new document, add the page to the document, save the document, close it and then load it again. Unfortunately the performance cost of this workaround is prohibitive.

      Attachments

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              elioenai eddie.greene@gmail.com
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: