Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: Text extraction
    • Labels:
      None

      Description

      This is an update for PDFBOX-420 filed by Takashi Komatsubara.

      In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.

      I have published this patch last year[2], and got some good feedbacks from Japanese users[3].

      [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
      [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja,
      https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
      [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

        Attachments

        1. identity-h.patch
          5 kB
          Atsuo Ishimoto
        2. China.pdf
          243 kB
          Takashi Komatsubara

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                aishimoto Atsuo Ishimoto
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: