Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-939

Lost whitespaces when extracting Arabic text

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 1.5.0
    • Text extraction
    • None

    Description

      I tried to extract text from an arabic PDF. Result looks good for the first look, but if you look closer, you may notice that some of whitespaces is missing comparing to copy/pasted text from the same PDF.

      Copy/pasted line from attached PDF:
      بعد ما اكتشف حقيقة المثلث الغامض

      Extracted text:
      بعد ما اكتشف حقيقةالمثلثالغامض

      Attachments

        1. test.pdf
          28 kB
          Anton Stremoukhov
        2. extracted.txt
          0.4 kB
          Anton Stremoukhov

        Activity

          People

            lehmi Andreas Lehmkühler
            delson Anton Stremoukhov
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: