PDFBox
  1. PDFBox
  2. PDFBOX-1169

Images extracted from PDF are loosing color (are shown in blackcolor)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.8.0
    • Component/s: Utilities
    • Environment:
      Windows

      Description

      Using PDFBox, tried to read file (eBook-Mini.pdf, which is attached)
      When images are extracted using below mentioned code, the extracted images aren't as per the ones in PDF, they have lost color.
      Checked extracting images, using other tools and images were extracted correctly.
      Attached images extracted using PDFBox as well.

      1. eBook-Mini.pdf
        1.66 MB
        susheel
      2. image-1.jpg
        18 kB
        susheel
      3. image-2.jpg
        854 kB
        susheel

        Issue Links

          Activity

          Hide
          susheel added a comment -

          eBook-Mini is the sample PDF that we have used for extracting image from the PDF.

          Show
          susheel added a comment - eBook-Mini is the sample PDF that we have used for extracting image from the PDF.
          Hide
          susheel added a comment -

          Images which were extracted after reading the PDF using PDFBox.

          Show
          susheel added a comment - Images which were extracted after reading the PDF using PDFBox.
          Hide
          susheel added a comment -

          Comment to extract the image:

          private void processImages(PDResources resources, String destinationFolder) throws IOException {
          Map images = resources.getImages();

          if (images != null) {
          Iterator imageIter = images.keySet().iterator();
          while (imageIter.hasNext())

          { String key = (String) imageIter.next(); PDXObjectImage image = (PDXObjectImage) images.get(key); String name = null; name = destinationFolder + "image-" + imageCounter++ + "." + image.getSuffix(); //image.write2file(name); - Tried image.write2file as well, but retrieved images were similar BufferedImage bufferedImage = image.getRGBImage(); File outputfile = new File(name); ImageIO.write(bufferedImage,image.getSuffix(), outputfile); System.out.println("szaveri - using imageio to write files " + name + " suffix =" + image.getSuffix()); }

          }
          }

          Please note, out of 200 odd images in the PDF, only two got extracted correctly rest all are having images with black background.

          I am sure, I am missing out some configuration or someother parameter, but unable to find it out.

          Just to update, have also added following JAI Jars in my project:
          jai_codec
          jai_core
          mlibwrapper_jai

          Show
          susheel added a comment - Comment to extract the image: private void processImages(PDResources resources, String destinationFolder) throws IOException { Map images = resources.getImages(); if (images != null) { Iterator imageIter = images.keySet().iterator(); while (imageIter.hasNext()) { String key = (String) imageIter.next(); PDXObjectImage image = (PDXObjectImage) images.get(key); String name = null; name = destinationFolder + "image-" + imageCounter++ + "." + image.getSuffix(); //image.write2file(name); - Tried image.write2file as well, but retrieved images were similar BufferedImage bufferedImage = image.getRGBImage(); File outputfile = new File(name); ImageIO.write(bufferedImage,image.getSuffix(), outputfile); System.out.println("szaveri - using imageio to write files " + name + " suffix =" + image.getSuffix()); } } } Please note, out of 200 odd images in the PDF, only two got extracted correctly rest all are having images with black background. I am sure, I am missing out some configuration or someother parameter, but unable to find it out. Just to update, have also added following JAI Jars in my project: jai_codec jai_core mlibwrapper_jai
          Hide
          Andreas Lehmkühler added a comment -

          I found 3 different issues:

          • the given pdf contains 2 images which are embedded in a XObjectForm which is embedded in another XObjectForm and can't be extracted using ExtractImages. I fixed that in revision 1209017
          • PDJpeg.write2OutputStream assumed that every PDJpeg contains jpeg image data because of the used DCTFilter, but PDJpegs may also contain CMYK-encoded image data as in the given pdf. I fixed that in revision 1209015
          • the colors of the image are wrong, but I don't know why. I'm still investigating
          Show
          Andreas Lehmkühler added a comment - I found 3 different issues: the given pdf contains 2 images which are embedded in a XObjectForm which is embedded in another XObjectForm and can't be extracted using ExtractImages. I fixed that in revision 1209017 PDJpeg.write2OutputStream assumed that every PDJpeg contains jpeg image data because of the used DCTFilter, but PDJpegs may also contain CMYK-encoded image data as in the given pdf. I fixed that in revision 1209015 the colors of the image are wrong, but I don't know why. I'm still investigating
          Hide
          susheel added a comment -

          Dear Andreas

          Wish that you crack the thrid issue quite quickly.

          We have taken your two fixes and have ran the test on the PDF that we have. Image quality has improved considerably. I am sure, once we have the final issue fix from your end, we should be able to parse the PDF image quite easily.

          If you need any data / inputs from our end, kindly let us know.

          Thanks
          Susheel Zaveri

          Show
          susheel added a comment - Dear Andreas Wish that you crack the thrid issue quite quickly. We have taken your two fixes and have ran the test on the PDF that we have. Image quality has improved considerably. I am sure, once we have the final issue fix from your end, we should be able to parse the PDF image quite easily. If you need any data / inputs from our end, kindly let us know. Thanks Susheel Zaveri
          Hide
          Andreas Lehmkühler added a comment -

          I guess the remaing issues is based on a missing feature called overprintcontrol which is part of the extended graphics state. PDFBOX-1223 describes a similar issue.

          Show
          Andreas Lehmkühler added a comment - I guess the remaing issues is based on a missing feature called overprintcontrol which is part of the extended graphics state. PDFBOX-1223 describes a similar issue.
          Hide
          Andreas Lehmkühler added a comment -

          My former guess was wrong. The JPEG uses a CMYK-colorspace but the image data are encoded using a YCCK colorspace.
          I added a YCCK2RGB decoder in revision 1395294.

          Thanks for the report!

          Show
          Andreas Lehmkühler added a comment - My former guess was wrong. The JPEG uses a CMYK-colorspace but the image data are encoded using a YCCK colorspace. I added a YCCK2RGB decoder in revision 1395294. Thanks for the report!

            People

            • Assignee:
              Andreas Lehmkühler
              Reporter:
              susheel
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development