Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1424

Wrong glyph (Persian) is used in extacted text instead of the original glyph (Persian) in PDF file

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.1
    • Fix Version/s: 1.8.0
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Windows XP, Java 1.6.0

      Description

      Hi
      I am very new to PDFBox and I am dealing with Persian PDF files. When I convert Persian PDF files using PDFBox-app, some Persian glyphs like م are displayed wrongly in the extracted text. For example, the word "هستم" in Persian is extracted as "هستن" and "من" in Persian is extracted as "هن". Also, the word "سلام" is extracted as "سالم". By the way, I have tested extracting text from a complete Persian PDF file with multiple pages; the result is disappointing. Actually, it is totally wrong. Please let me know if I should upload an example Persian PDF file.

        Attachments

        1. PDFBOX1424-persian_test.html
          2 kB
          Andreas Lehmkühler
        2. persian_test.html
          1 kB
          Ali Majdzadeh Kohbanani
        3. persian_test.pdf
          28 kB
          Ali Majdzadeh Kohbanani

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              majdzadeh Ali Majdzadeh Kohbanani
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: