Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-620

Text extract fails on some PDF files but not others...

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Implemented
    • 0.7.3, 0.8.0-incubator
    • None
    • Text extraction
    • None
    • Tried in Java 5 and 6

    Description

      Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits stratenull nullr launnull chinnulla new skincare rannull in a competitive market. nulle resultnullIncreased sales nullr pharmacies that used the updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a key part o?its strate? ?r laun?
      chin?a new skincare ran? in a competitive market. ?e result?Increased
      sales ?r pharmacies that used the updates."

      Maybe this is a font problem? Or encoding? I debugged the code in PDFTextStripper and and these appear in the charactersByArticle field even before normalization.

      In 0.8.0 I get some info logs from the engine:

      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: re
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: W
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: n
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: cs
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: scn
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: f
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: CS
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: SCN
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: M
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: m
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: l
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: S
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: BDC
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: c
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: v
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: y
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: h
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: g
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: G
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: EMC

      I got the same error with icu4j 3.6.1 and 4.2.1

      Attachments

        1. pdf620-fails.pdf
          569 kB
          Nicholas Cottrell
        2. pdf620-works.pdf
          604 kB
          Nicholas Cottrell

        Issue Links

          Activity

            People

              Unassigned Unassigned
              niccottrell Nicholas Cottrell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: