Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2252

PDFTextStripper has problem with documents with mixed language directions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.8.6, 2.0.0
    • 2.0.0
    • Text extraction

    Description

      When the input document of PDFTextStripper is a combination of right-to-left and left-to-right languages, the output characters of one language is reversed.
      A sample bilingual pdf document is attached.
      PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
      This class clearly count the number of rtl characters and decide if the whole content should be revered or not. It's not true, it must operate on each word, not the whole document.

      Attachments

        1. test.pdf
          43 kB
          Amir
        2. PDFTextStripper.java.patch
          19 kB
          Andreas Meier
        3. atest.pdf
          21 kB
          Andreas Meier
        4. wikipedia_dl_lyric_test.pdf
          120 kB
          Andreas Meier
        5. overlap.jpg
          13 kB
          Andreas Meier
        6. PDFTextStripper.java.patch
          17 kB
          Andreas Meier
        7. BidiMirroring.txt
          24 kB
          Andreas Meier
        8. IsMirroredDeviations.txt
          9 kB
          Maruan Sahyoun
        9. bugzilla867751.pdf
          6.92 MB
          Tilman Hausherr
        10. PDFTextStripper-201709271718.patch
          14 kB
          Maruan Sahyoun
        11. PDFTextStripper-201709272018.patch
          8 kB
          Maruan Sahyoun
        12. pdfs_directionality.xlsx
          48 kB
          Tim Allison
        13. pdfs_directionality3.xlsx
          25 kB
          Tim Allison
        14. content_diffs.xlsx
          88 kB
          Tim Allison

        Issue Links

          Activity

            People

              msahyoun Maruan Sahyoun
              amirjadidi Amir
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: