Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-620

Text extract fails on some PDF files but not others...

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Implemented
    • Affects Version/s: 0.7.3, 0.8.0-incubator
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Tried in Java 5 and 6

      Description

      Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits stratenull nullr launnull chinnulla new skincare rannull in a competitive market. nulle resultnullIncreased sales nullr pharmacies that used the updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a key part o?its strate? ?r laun?
      chin?a new skincare ran? in a competitive market. ?e result?Increased
      sales ?r pharmacies that used the updates."

      Maybe this is a font problem? Or encoding? I debugged the code in PDFTextStripper and and these appear in the charactersByArticle field even before normalization.

      In 0.8.0 I get some info logs from the engine:

      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: re
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: W
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: n
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: cs
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: scn
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: f
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: CS
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: SCN
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: M
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: m
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: l
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: S
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: BDC
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: c
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: v
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: y
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: h
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: g
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: G
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: EMC

      I got the same error with icu4j 3.6.1 and 4.2.1

        Attachments

        1. pdf620-works.pdf
          604 kB
          Nicholas Cottrell
        2. pdf620-fails.pdf
          569 kB
          Nicholas Cottrell

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                niccottrell Nicholas Cottrell
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: