PDFBox
  1. PDFBox
  2. PDFBOX-620

Text extract fails on some PDF files but not others...

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Implemented
    • Affects Version/s: 0.7.3, 0.8.0-incubator
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Tried in Java 5 and 6

      Description

      Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits stratenull nullr launnull chinnulla new skincare rannull in a competitive market. nulle resultnullIncreased sales nullr pharmacies that used the updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a key part o?its strate? ?r laun?
      chin?a new skincare ran? in a competitive market. ?e result?Increased
      sales ?r pharmacies that used the updates."

      Maybe this is a font problem? Or encoding? I debugged the code in PDFTextStripper and and these appear in the charactersByArticle field even before normalization.

      In 0.8.0 I get some info logs from the engine:

      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: re
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: W
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: n
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: cs
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: scn
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: f
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: CS
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: SCN
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: M
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: m
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: l
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: S
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: BDC
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: c
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: v
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: y
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: h
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: g
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: G
      SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: EMC

      I got the same error with icu4j 3.6.1 and 4.2.1

      1. pdf620-works.pdf
        604 kB
        Nicholas Cottrell
      2. pdf620-fails.pdf
        569 kB
        Nicholas Cottrell

        Issue Links

          Activity

          Nicholas Cottrell created issue -
          Nicholas Cottrell made changes -
          Field Original Value New Value
          Attachment pdf620-works.pdf [ 12435833 ]
          Nicholas Cottrell made changes -
          Attachment pdf620-fails.pdf [ 12435834 ]
          Nicholas Cottrell made changes -
          Description Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits stratenull nullr launnull chinnulla new skincare rannull in a competitive market. nulle resultnullIncreased sales nullr pharmacies that used the updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a key part o?its strate? ?r laun?
          chin?a new skincare ran? in a competitive market. ?e result?Increased
          sales ?r pharmacies that used the updates."

          Maybe this is a font problem? Or encoding? I debugged the code in PDFTextStripper and and these appear in the charactersByArticle field even before normalization.

          Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits stratenull nullr launnull chinnulla new skincare rannull in a competitive market. nulle resultnullIncreased sales nullr pharmacies that used the updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a key part o?its strate? ?r laun?
          chin?a new skincare ran? in a competitive market. ?e result?Increased
          sales ?r pharmacies that used the updates."

          Maybe this is a font problem? Or encoding? I debugged the code in PDFTextStripper and and these appear in the charactersByArticle field even before normalization.

          In 0.8.0 I get some info logs from the engine:

          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: re
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: W
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: n
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: cs
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: scn
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: f
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: CS
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: SCN
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: M
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: m
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: l
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: S
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: BDC
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: c
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: v
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: y
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: h
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: g
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: G
          SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: EMC

          I got the same error with icu4j 3.6.1 and 4.2.1

          Nicholas Cottrell made changes -
          Comment [ The garbled character always represents at least 2 characters (sometimes one is a space) ]
          Villu Ruusmann made changes -
          Link This issue requires PDFBOX-619 [ PDFBOX-619 ]
          Villu Ruusmann made changes -
          Link This issue requires PDFBOX-542 [ PDFBOX-542 ]
          Tilman Hausherr made changes -
          Status Open [ 1 ] Closed [ 6 ]
          Resolution Implemented [ 10 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Nicholas Cottrell
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development