Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-444

Incorrect Diacritic Merging/Placement

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.8.0-incubator
    • Text extraction
    • None

    Description

      When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here.
      The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word

      And¨ erung, should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,

      ¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document.

      Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
      the diacritic TextPosition comes beforehand.

      Attachments

        1. Diacritic_fix.diff
          24 kB
          Justin LeFebvre
        2. 03_2_SSL-unsorted.txt
          32 kB
          Justin LeFebvre
        3. 03_2_SSL-sorted.txt
          32 kB
          Justin LeFebvre
        4. 03_2_SSL.pdf
          185 kB
          Justin LeFebvre

        Issue Links

          Activity

            People

              Unassigned Unassigned
              justinl Justin LeFebvre
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: