Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5868

PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used the export:text command line tool to obtain the results

      • the multilingual_test.pdf is the original pdf i made to test multilingual text extraction.
      • the pdfbox_out.txt is the text file produced by pdfbox
      • the adobe_out.txt is the text file created by adobe reader's save as text feature

       

      Observation:

      as you can see in the attachment the text file obtained by pdfbox shows weird unicodes for tamil and bengali (for hindi the charecters are extracted but not overlapped; japanese seems fine to me). in contrast the text file file obtained from adobe reader's save as text feature seems fine and copy pasting the text from my document viewer(evince) also works.

      Questions:

      1. why are the outputs from pdfbox and adobe different?
      2. what can i do to extract the text from a multilingual pdf correctly?
      3. Is there a way to apply pattern matching to text in pdf file and declare matches without extracting the text first? (say if the problem is with fonts and glyphs)

      My Usecase fyi:

      i am trying to extract text from files and run pattern matching. I am using apache tika for parsing documents. I noticed problem with extracted PDF text (other filetypes parse fine). used executable pdfbox jar to conclude that the problem is in pdfbox and not in tika. tested with adobe reader's extract text to confirm the problem is not with the pdf. i  want to extract these multilingual text to run pattern matching on them alone and do not need to display the content but only if the pattern is present or not (say if the problem is with fonts and glyphs)

       

      Attachments

        1. adobe_out.txt
          4 kB
          Manish S N
        2. content_diffs_with_exceptions-ActualText.xlsx
          1.62 MB
          Tilman Hausherr
        3. EmptyActualText_poppler.txt
          2 kB
          Manish S N
        4. EmptyActualText_reduced_poppler.txt
          0.0 kB
          Manish S N
        5. image-2024-08-19-10-38-13-472.png
          10 kB
          Manish S N
        6. image-2024-08-30-17-55-41-423.png
          6 kB
          Manish S N
        7. Main.java
          0.7 kB
          Manish S N
        8. multilingual_test.pdf
          85 kB
          Manish S N
        9. okular_out.txt
          4 kB
          Manish S N
        10. page.pdf
          124 kB
          Manish S N
        11. pdfbox_out.txt
          4 kB
          Manish S N
        12. PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf
          315 kB
          Tilman Hausherr
        13. PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf
          9 kB
          Tilman Hausherr
        14. PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf
          123 kB
          Tilman Hausherr
        15. poppler_out.txt
          4 kB
          Manish S N
        16. screenshot-1.png
          73 kB
          Tilman Hausherr
        17. screenshot-2.png
          19 kB
          Tilman Hausherr
        18. suppressDuplicateOverlapping_out.txt
          4 kB
          Manish S N
        19. Tilman's_solution_out.txt
          4 kB
          Manish S N

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tilman Tilman Hausherr
            manish003 Manish S N

            Dates

              Created:
              Updated:

              Slack

                Issue deployment