Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4481

Text extraction error with Thai combined glyph depending on space after it

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.14
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:

      Description

      In the first extracted line of the reduced file, the "accent" (somebody please correct me what that thing is) is separate. On the second line it is at the proper place. Content stream:

      BT
        1 0 0 1 67.3 756.98 Tm
        [ (\000\203\000\227\000q) ] TJ
        1 0 0 1 77.5 756.98 Tm
        [ (\000\003) ] TJ
        1 0 0 1 67.3 730 Tm
        [ (\000\203\000\227\000q\000\003) ] TJ
      ET
      

      The weird thing is that the "\003" is just a space. So when the space is in the string the extraction works, and when it isn't, it doesn't.

        Attachments

        1. SO54981236.pdf
          51 kB
          Tilman Hausherr
        2. SO54981236-reduced.pdf
          26 kB
          Tilman Hausherr
        3. SO54981236-reduced.txt
          0.0 kB
          Tilman Hausherr

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tilman Tilman Hausherr
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: