PDFBox
  1. PDFBox
  2. PDFBOX-981

PDColorspaceFactory does not recognize colorspace DeviceGray (patch included herein)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.6.0
    • Component/s: Text extraction
    • Labels:

      Description

      I was trying to use PDFTextStripper to extract text from a large corpus of PDF files. In some of them, the method:

      org.apache.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace( COSBase colorSpace, Map colorSpaces )

      fails to recognize the case when the colorSpace argument is of type COSArray and the array's (first) element corresponds to COSName.DEVICEGRAY. Adding that case successfully parses the files that failed with the stock pdfbox-1.5.0. Below is a diff of my patched PDColorSpaceFactory that handles the case where the colorspace name is DeviceGray. Incidentally, it occurs to me that another (possibly better) approach is to call through to createColorSpace(String) when no other case matches.

      % diff PDColorSpaceFactory.java.orig PDColorSpaceFactory.java
      94a95,97
      > else if ( type.getName().equals( PDDeviceGray.NAME) )

      { > retval = new PDDeviceGray(); > }
      1. example.pdf
        72 kB
        Matt England
      2. PDColorSpaceFactory.java.diff
        0.1 kB
        Matt England

        Activity

        Matt England created issue -
        Hide
        Matt England added a comment -

        Example pdf file which fails with standard 1.5.0 but passes with included patch. Using PDFTextStripper like so:

        (new PDFTextStripper()).getText(PDDocument.load(new FileInputStream("example.pdf")))

        Show
        Matt England added a comment - Example pdf file which fails with standard 1.5.0 but passes with included patch. Using PDFTextStripper like so: (new PDFTextStripper()).getText(PDDocument.load(new FileInputStream("example.pdf")))
        Matt England made changes -
        Field Original Value New Value
        Attachment example.pdf [ 12473724 ]
        Hide
        Matt England added a comment -

        Patch for PDColorSpaceFactory

        Show
        Matt England added a comment - Patch for PDColorSpaceFactory
        Matt England made changes -
        Attachment PDColorSpaceFactory.java.diff [ 12473725 ]
        Hide
        Andreas Lehmkühler added a comment -

        I added the proposed patch in revision 1083488.

        Thanks for the contribution!

        Show
        Andreas Lehmkühler added a comment - I added the proposed patch in revision 1083488. Thanks for the contribution!
        Andreas Lehmkühler made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Andreas Lehmkühler [ lehmi ]
        Fix Version/s 1.6.0 [ 12316242 ]
        Resolution Fixed [ 1 ]
        Andreas Lehmkühler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Andreas Lehmkühler
            Reporter:
            Matt England
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development