Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2252

PDFTextStripper has problem with documents with mixed language directions

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.8.6, 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: Text extraction
    • Labels:

      Description

      When the input document of PDFTextStripper is a combination of right-to-left and left-to-right languages, the output characters of one language is reversed.
      A sample bilingual pdf document is attached.
      PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
      This class clearly count the number of rtl characters and decide if the whole content should be revered or not. It's not true, it must operate on each word, not the whole document.

        Attachments

        1. wikipedia_dl_lyric_test.pdf
          120 kB
          Andreas Meier
        2. test.pdf
          43 kB
          Amir
        3. PDFTextStripper-201709272018.patch
          8 kB
          Maruan Sahyoun
        4. PDFTextStripper-201709271718.patch
          14 kB
          Maruan Sahyoun
        5. PDFTextStripper.java.patch
          19 kB
          Andreas Meier
        6. PDFTextStripper.java.patch
          17 kB
          Andreas Meier
        7. pdfs_directionality3.xlsx
          25 kB
          Tim Allison
        8. pdfs_directionality.xlsx
          48 kB
          Tim Allison
        9. overlap.jpg
          13 kB
          Andreas Meier
        10. IsMirroredDeviations.txt
          9 kB
          Maruan Sahyoun
        11. content_diffs.xlsx
          88 kB
          Tim Allison
        12. bugzilla867751.pdf
          6.92 MB
          Tilman Hausherr
        13. BidiMirroring.txt
          24 kB
          Andreas Meier
        14. atest.pdf
          21 kB
          Andreas Meier

          Issue Links

            Activity

              People

              • Assignee:
                msahyoun Maruan Sahyoun
                Reporter:
                amirjadidi Amir
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: