Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3833

Characters in wrong order

    XMLWordPrintableJSON

Details

    Description

      The attached pdf file (which is page 3 of https://jp.mathworks.com/tagteam/89688_93050v00_JP_machine_learning_section1_ebook.pdf) shows multiple problems when reading with PDFBox in standard settings. This bug report in particular is about the Katakana ー being misplaced.

      In the text block on the left, the second line starts with ターン. PDFTextStripper.getText returns text starting with タ ンー (i.e., adding a space after the first character and swapping the second and third one). This effect also happens at other places in the (complete) file.

      The PDF itself at this point has [<03BB>43.9 <0294>156 <03EF>-24.5 ...]TJ, listing the characters in the proper order. Copy&paste using Apple's Preview.App also preserves that order.

      Attachments

        1. ML_mathworks_unc2.pdf
          310 kB
          Christopher Creutzig
        2. PDFBOX-3833-reduced.pdf
          99 kB
          Tilman Hausherr

        Activity

          People

            tilman Tilman Hausherr
            ccreutzig Christopher Creutzig
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: