Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-981

PDColorspaceFactory does not recognize colorspace DeviceGray (patch included herein)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.6.0
    • Component/s: Text extraction
    • Labels:

      Description

      I was trying to use PDFTextStripper to extract text from a large corpus of PDF files. In some of them, the method:

      org.apache.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace( COSBase colorSpace, Map colorSpaces )

      fails to recognize the case when the colorSpace argument is of type COSArray and the array's (first) element corresponds to COSName.DEVICEGRAY. Adding that case successfully parses the files that failed with the stock pdfbox-1.5.0. Below is a diff of my patched PDColorSpaceFactory that handles the case where the colorspace name is DeviceGray. Incidentally, it occurs to me that another (possibly better) approach is to call through to createColorSpace(String) when no other case matches.

      % diff PDColorSpaceFactory.java.orig PDColorSpaceFactory.java
      94a95,97
      > else if ( type.getName().equals( PDDeviceGray.NAME) )

      { > retval = new PDDeviceGray(); > }

        Attachments

        1. example.pdf
          72 kB
          Matt England
        2. PDColorSpaceFactory.java.diff
          0.1 kB
          Matt England

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              mattengland Matt England
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: