Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2252

PDFTextStripper has problem with documents with mixed language directions


    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.8.6, 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: Text extraction
    • Labels:


      When the input document of PDFTextStripper is a combination of right-to-left and left-to-right languages, the output characters of one language is reversed.
      A sample bilingual pdf document is attached.
      PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
      This class clearly count the number of rtl characters and decide if the whole content should be revered or not. It's not true, it must operate on each word, not the whole document.


        1. content_diffs.xlsx
          88 kB
          Tim Allison
        2. pdfs_directionality3.xlsx
          25 kB
          Tim Allison
        3. pdfs_directionality.xlsx
          48 kB
          Tim Allison
        4. PDFTextStripper-201709272018.patch
          8 kB
          Maruan Sahyoun
        5. PDFTextStripper-201709271718.patch
          14 kB
          Maruan Sahyoun
        6. bugzilla867751.pdf
          6.92 MB
          Tilman Hausherr
        7. IsMirroredDeviations.txt
          9 kB
          Maruan Sahyoun
        8. BidiMirroring.txt
          24 kB
          Andreas Meier
        9. PDFTextStripper.java.patch
          17 kB
          Andreas Meier
        10. overlap.jpg
          13 kB
          Andreas Meier
        11. wikipedia_dl_lyric_test.pdf
          120 kB
          Andreas Meier
        12. atest.pdf
          21 kB
          Andreas Meier
        13. PDFTextStripper.java.patch
          19 kB
          Andreas Meier
        14. test.pdf
          43 kB

          Issue Links



              • Assignee:
                msahyoun Maruan Sahyoun
                amirjadidi Amir
              • Votes:
                0 Vote for this issue
                9 Start watching this issue


                • Created: