Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4481

Text extraction error with Thai combined glyph depending on space after it

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.14
    • None
    • Text extraction

    Description

      In the first extracted line of the reduced file, the "accent" (somebody please correct me what that thing is) is separate. On the second line it is at the proper place. Content stream:

      BT
        1 0 0 1 67.3 756.98 Tm
        [ (\000\203\000\227\000q) ] TJ
        1 0 0 1 77.5 756.98 Tm
        [ (\000\003) ] TJ
        1 0 0 1 67.3 730 Tm
        [ (\000\203\000\227\000q\000\003) ] TJ
      ET
      

      The weird thing is that the "\003" is just a space. So when the space is in the string the extraction works, and when it isn't, it doesn't.

      Attachments

        1. SO54981236-reduced.txt
          0.0 kB
          Tilman Hausherr
        2. SO54981236-reduced.pdf
          26 kB
          Tilman Hausherr
        3. SO54981236.pdf
          51 kB
          Tilman Hausherr

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tilman Tilman Hausherr
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: