Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0-incubator
    • Fix Version/s: 1.2.0
    • Component/s: Text extraction, Utilities
    • Labels:
      None

      Description

      Scientific publishers often publish older articles (year 2000 and earlier) in scanned form. However, sometimes they seem to have conducted OCR, and added the recovered text as an overlay in order to give the end user a "native PDF" feeling in a sense that it is possible to copy and paste text.

      PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part and the textual overlay part, which may produce confusing results.

      Actually, there are two separate cases:
      *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the image part and ignore the text part.
      *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the image part and work upon the text part.

        Attachments

        1. pg_0005.png
          417 kB
          Villu Ruusmann
        2. pg_0005.pdf
          56 kB
          Villu Ruusmann
        3. PDFBOX582-pg_00051.png
          138 kB
          Andreas Lehmkühler
        4. PageDrawer.patch
          3 kB
          Maruan Sahyoun

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                vfed Villu Ruusmann
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: