PDFBox
  1. PDFBox
  2. PDFBOX-620

Text extract fails on some PDF files but not others...

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Implemented
    • Affects Version/s: 0.7.3, 0.8.0-incubator
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Tried in Java 5 and 6

      Description

      Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits stratenull nullr launnull chinnulla new skincare rannull in a competitive market. nulle resultnullIncreased sales nullr pharmacies that used the updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a key part o?its strate? ?r laun?
      chin?a new skincare ran? in a competitive market. ?e result?Increased
      sales ?r pharmacies that used the updates."

      Maybe this is a font problem? Or encoding? I debugged the code in PDFTextStripper and and these appear in the charactersByArticle field even before normalization.

      In 0.8.0 I get some info logs from the engine:

      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: re
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: W
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: n
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: cs
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: scn
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: f
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: CS
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: SCN
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: M
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: m
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: l
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: S
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: BDC
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: c
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: v
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: y
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: h
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: g
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: G
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: EMC

      I got the same error with icu4j 3.6.1 and 4.2.1

      1. pdf620-fails.pdf
        569 kB
        Nicholas Cottrell
      2. pdf620-works.pdf
        604 kB
        Nicholas Cottrell

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Unassigned
              Reporter:
              Nicholas Cottrell
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development