PDFBox
  1. PDFBox
  2. PDFBOX-1304

Text extraction meets "Could not parse predefined CMAP" and returns just a small part of the content containing garbage chars.

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.6.0
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Win7 32bits

      Description

      i'm using pdfbox-1.6.0 for text extraction from a Chinese pdf file(see the attachment "fj.pdf").

      the extraction code looks like below:
      [code]
      stripper = new PDFTextStripper(encoding);
      txt = stripper.getText(_pdfDoc);
      [/code]
      when running getText(), the console says :
      [console]
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKUO1-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
      五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
      严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
      [/console]
      after getText() returns, the txt contains just a small part of the pdf content (lots are missing) and some garbage chars like "犖犑狌犣犎犗犝犔犻犺犅"(see attachment "fj.txt").

      I've heard some said that the "org.apache.pdfbox.cos.COSString.java" has some errors when pdfbox-0.7.3. Has COSString.java been corrected in 1.6.0?

      1. fj.pdf
        804 kB
        Huan LI
      2. fj.txt
        5 kB
        Huan LI

        Activity

        There are no comments yet on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Huan LI
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development