Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4934

Could not find referenced cmap stream Adobe-Japan1-XXXX

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.20
    • 2.0.22, 3.0.0 PDFBox
    • FontBox
    • None
    • Windows10, 64bit

    Description

      The IOException exception occurs when attached pdf feeded into PDFBox.

      The attached pdf (JP.pdf) file include Adobe-Japan1-65534 cmap.
      source code is as below.

      import javax.imageio.ImageIO;
      
      import org.apache.commons.io.FileUtils;
      import org.apache.pdfbox.pdmodel.PDDocument;
      import org.apache.pdfbox.pdmodel.PDPage;
      import org.apache.pdfbox.rendering.ImageType;
      import org.apache.pdfbox.rendering.PDFRenderer;
      import org.apache.pdfbox.text.PDFTextStripper;
      import org.apache.pdfbox.text.TextPosition;
      
      public class pdfBoxTest {
      	public static void main(String[] args) throws Exception {
      		pdfBoxTest sample = new pdfBoxTest();
      
      		String pdfname = "D:/tmp/jp.pdf";
      		File pdf = FileUtils.getFile(pdfname);
      
      		sample.extractTextFromPDF(pdf);
      		sample.load(pdf);
      	}
      
      	public void load(File pdf) throws Exception {
      
      		PDDocument document = PDDocument.load(pdf);
      		PDFRenderer renderer = new PDFRenderer(document);
      		BufferedImage bufImage = renderer.renderImageWithDPI(0, 300, ImageType.RGB);
      
      		ImageIO.write(bufImage, "jpg", new File("D:/tmp/jp.jpg"));
      	}
      }
      

      getExternalCMap mehod in CMapParse.class tries to find external CMap, but
      it couldn't find Japan1-65534 and throws exception.

      I know that there is no such a CMap, but it is no problem to open this PDF file,
      so I think it is better not to throw exception and use another CMap.
      I modified source code as below temporarily. it works well.

      protected InputStream getExternalCMap(String name) throws IOException {
            InputStream is = this.getClass().getResourceAsStream(name);
             if(is == null) {
                if(name.startsWith("Adobe-Japan1")) {
                   name = "Adobe-Japan1-1";
                } else if(name.startsWith("Adobe-Korea1")) {
                   name = "Adobe-Korea1-1";
                }
                is = this.getClass().getResourceAsStream(name);
                if(is == null) {
                   throw new IOException("Error: Could not find referenced cmap stream " + name);
                }  
            }
      
             return is;
       }
      

      But it is not essential one.
      If possible态I would like to ask you to modify source code not to throw exception if
      it cannot find Cmap.

      I found another Korean pdf file, it includes Adode-Korea1-3 Cmap.

      Please refer to attached file.

      Thanks!

      //Okada

      Attachments

        1. PDFBOX-4934-Korea.pdf-1.png
          116 kB
          Tilman Hausherr
        2. PDFBOX-4934-JP.pdf-1.png
          183 kB
          Tilman Hausherr
        3. PDFBOX-4934-Korea.pdf.txt
          4 kB
          Tilman Hausherr
        4. PDFBOX-4934-JP.pdf.txt
          2 kB
          Tilman Hausherr
        5. Korea.pdf
          561 kB
          Shigeru Okada
        6. JP.pdf
          504 kB
          Shigeru Okada

        Activity

          People

            tilman Tilman Hausherr
            Shigeru_Okada Shigeru Okada
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: