Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2252

PDFTextStripper has problem with documents with mixed language directions


    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.8.6, 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: Text extraction
    • Labels:


      When the input document of PDFTextStripper is a combination of right-to-left and left-to-right languages, the output characters of one language is reversed.
      A sample bilingual pdf document is attached.
      PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
      This class clearly count the number of rtl characters and decide if the whole content should be revered or not. It's not true, it must operate on each word, not the whole document.


        1. atest.pdf
          21 kB
          Andreas Meier
        2. BidiMirroring.txt
          24 kB
          Andreas Meier
        3. bugzilla867751.pdf
          6.92 MB
          Tilman Hausherr
        4. content_diffs.xlsx
          88 kB
          Tim Allison
        5. IsMirroredDeviations.txt
          9 kB
          Maruan Sahyoun
        6. overlap.jpg
          13 kB
          Andreas Meier
        7. pdfs_directionality.xlsx
          48 kB
          Tim Allison
        8. pdfs_directionality3.xlsx
          25 kB
          Tim Allison
        9. PDFTextStripper.java.patch
          17 kB
          Andreas Meier
        10. PDFTextStripper.java.patch
          19 kB
          Andreas Meier
        11. PDFTextStripper-201709271718.patch
          14 kB
          Maruan Sahyoun
        12. PDFTextStripper-201709272018.patch
          8 kB
          Maruan Sahyoun
        13. test.pdf
          43 kB
        14. wikipedia_dl_lyric_test.pdf
          120 kB
          Andreas Meier

          Issue Links



              • Assignee:
                msahyoun Maruan Sahyoun
                amirjadidi Amir
              • Votes:
                0 Vote for this issue
                9 Start watching this issue


                • Created: