Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2252

PDFTextStripper has problem with documents with mixed language directions

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.8.6, 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: Text extraction
    • Labels:

      Description

      When the input document of PDFTextStripper is a combination of right-to-left and left-to-right languages, the output characters of one language is reversed.
      A sample bilingual pdf document is attached.
      PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
      This class clearly count the number of rtl characters and decide if the whole content should be revered or not. It's not true, it must operate on each word, not the whole document.

        Attachments

        1. test.pdf
          43 kB
          Amir
        2. PDFTextStripper.java.patch
          19 kB
          Andreas Meier
        3. atest.pdf
          21 kB
          Andreas Meier
        4. wikipedia_dl_lyric_test.pdf
          120 kB
          Andreas Meier
        5. overlap.jpg
          13 kB
          Andreas Meier
        6. PDFTextStripper.java.patch
          17 kB
          Andreas Meier
        7. BidiMirroring.txt
          24 kB
          Andreas Meier
        8. IsMirroredDeviations.txt
          9 kB
          Maruan Sahyoun
        9. bugzilla867751.pdf
          6.92 MB
          Tilman Hausherr
        10. PDFTextStripper-201709271718.patch
          14 kB
          Maruan Sahyoun
        11. PDFTextStripper-201709272018.patch
          8 kB
          Maruan Sahyoun
        12. pdfs_directionality.xlsx
          48 kB
          Tim Allison
        13. pdfs_directionality3.xlsx
          25 kB
          Tim Allison
        14. content_diffs.xlsx
          88 kB
          Tim Allison

          Issue Links

            Activity

              People

              • Assignee:
                msahyoun Maruan Sahyoun
                Reporter:
                amirjadidi Amir
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: