PDFBox
  1. PDFBox
  2. PDFBOX-620

Text extract fails on some PDF files but not others...

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Implemented
    • Affects Version/s: 0.7.3, 0.8.0-incubator
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Tried in Java 5 and 6

      Description

      Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits stratenull nullr launnull chinnulla new skincare rannull in a competitive market. nulle resultnullIncreased sales nullr pharmacies that used the updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a key part o?its strate? ?r laun?
      chin?a new skincare ran? in a competitive market. ?e result?Increased
      sales ?r pharmacies that used the updates."

      Maybe this is a font problem? Or encoding? I debugged the code in PDFTextStripper and and these appear in the charactersByArticle field even before normalization.

      In 0.8.0 I get some info logs from the engine:

      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: re
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: W
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: n
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: cs
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: scn
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: f
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: CS
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: SCN
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: M
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: m
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: l
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: S
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: BDC
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: c
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: v
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: y
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: h
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: g
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: G
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: EMC

      I got the same error with icu4j 3.6.1 and 4.2.1

      1. pdf620-fails.pdf
        569 kB
        Nicholas Cottrell
      2. pdf620-works.pdf
        604 kB
        Nicholas Cottrell

        Issue Links

          Activity

          Hide
          Nicholas Cottrell added a comment -

          Wonderful, thanks everyone!

          Show
          Nicholas Cottrell added a comment - Wonderful, thanks everyone!
          Hide
          Tilman Hausherr added a comment -

          Closing, it works for 1.8.5 and 2.0.

          Show
          Tilman Hausherr added a comment - Closing, it works for 1.8.5 and 2.0.
          Hide
          Villu Ruusmann added a comment -

          You are correct that this is a font encoding issue. All the fonts in file "pdf620-fails.pdf" do have explicit encodings set (open the file in Acrobat Reader and check "File" -> "Document Properties..." -> "Fonts"), whereas the ones in file "pdf620-fails.pdf" do not.

          The good news is that PDFBox's Type1C font support has been improved recently. If You try out the latest PDFBox 1.0.1-SNAPSHOT (You might need to apply PDFBOX-619 to SVN trunk if it is not there yet) this issue should be gone.

          Below are my text extraction results:
          Dermoapo made 'interactive updates' a key part of its strategy for laun-
          ching a new skincare range in a competitive market. The result? Increased
          sales for pharmacies that used the updates.

          Show
          Villu Ruusmann added a comment - You are correct that this is a font encoding issue. All the fonts in file "pdf620-fails.pdf" do have explicit encodings set (open the file in Acrobat Reader and check "File" -> "Document Properties..." -> "Fonts"), whereas the ones in file "pdf620-fails.pdf" do not. The good news is that PDFBox's Type1C font support has been improved recently. If You try out the latest PDFBox 1.0.1-SNAPSHOT (You might need to apply PDFBOX-619 to SVN trunk if it is not there yet) this issue should be gone. Below are my text extraction results: Dermoapo made 'interactive updates' a key part of its strategy for laun- ching a new skincare range in a competitive market. The result? Increased sales for pharmacies that used the updates.
          Hide
          Nicholas Cottrell added a comment - - edited

          "ff", "xp" and "ze" and Windows fancy apostrophe seemed to also get messed up....

          Show
          Nicholas Cottrell added a comment - - edited "ff", "xp" and "ze" and Windows fancy apostrophe seemed to also get messed up....
          Hide
          Nicholas Cottrell added a comment -

          This file generates strange errors in extraction

          Show
          Nicholas Cottrell added a comment - This file generates strange errors in extraction
          Hide
          Nicholas Cottrell added a comment - - edited

          This PDF file works

          Show
          Nicholas Cottrell added a comment - - edited This PDF file works

            People

            • Assignee:
              Unassigned
              Reporter:
              Nicholas Cottrell
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development