Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1652

TextPosition: Japanese alphabetic characters 30fc and 3005 treated as diacritics

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 1.8.1
    • None
    • Text extraction

    Description

      For the purpose of determining the position in text, the Japanese characters U+30fc (KATAKANA-HIRAGANA PROLONGED SOUND MARK) and U+3005 (IDEOGRAPHIC ITERATION MARK) are currently regarded "simple" diacritics. Apparently, they are fully-fledged characters in terms of text positioning.

      This can have the effect that when extracting text, some characters get actually reversed (particularly ーン can get ンー).

      A patch to fix this is attached.

      Attachments

        1. PDFBOX-1652.patch
          1 kB
          Christian Kohlschütter

        Activity

          People

            Unassigned Unassigned
            ck@newsclub.de Christian Kohlschütter
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: