Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5487

extra whitespaces when extracting Arabic text

    XMLWordPrintableJSON

Details

    • Important

    Description

      trying to extract text from an arabic PDF. You may notice that some of whitespaces are extracted in wrong place.

      Example:
      Original word: العالمية
      Extracted word: العالمي ة

       

      Pdf is attached, the example word is on the first line.

      Attachments

        1. arabtest.pdf
          33 kB
          Tilman Hausherr
        2. Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.pdf
          101 kB
          Fatemeh Elyasi
        3. Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.txt
          8 kB
          Mohamed M NourElDin
        4. Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR (withoutFixes).txt
          8 kB
          Mohamed M NourElDin
        5. meld1.png
          274 kB
          Mohamed M NourElDin
        6. meld2.png
          291 kB
          Mohamed M NourElDin
        7. meld3.png
          260 kB
          Mohamed M NourElDin
        8. PDFBOX-3774-reduced.pdf-sorted-diff.txt
          0.2 kB
          Tilman Hausherr
        9. PDFBOX-5487_ اعلامية.png
          32 kB
          Mohamed M NourElDin
        10. PDFBOX-5487_ وفضلا.png
          28 kB
          Mohamed M NourElDin
        11. PDFBOX-5487-arabic.pdf-sorted-diff.txt
          0.8 kB
          Tilman Hausherr
        12. screenshot-1.png
          11 kB
          Tilman Hausherr

        Issue Links

          Activity

            People

              tilman Tilman Hausherr
              Fatima_E Fatemeh Elyasi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: