Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-654

Extracting CJK text

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.1.0
    • Text extraction
    • None

    Description

      This is an update for PDFBOX-420 filed by Takashi Komatsubara.

      In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.

      I have published this patch last year[2], and got some good feedbacks from Japanese users[3].

      [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
      [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja,
      https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
      [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

      Attachments

        1. China.pdf
          243 kB
          Takashi Komatsubara
        2. identity-h.patch
          5 kB
          Atsuo Ishimoto

        Issue Links

          Activity

            People

              Unassigned Unassigned
              aishimoto Atsuo Ishimoto
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: