Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1155

setSuppressDuplicateOverlappingText sometimes removes characters that it shouldn't

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.8.7, 2.0.0
    • 2.0.0
    • Text extraction
    • None

    Description

      The duplicate detection (in PDFTextStripper.java) checks whether the
      same character was placed "nearish" to where we are about to place
      another and de-dups it if so; this is to catch documents that rewind
      and overwrite in order to bold word(s).

      But in some cases I see it removing valid characters (that were not
      dups).

      Attachments

        1. 000527.pdf
          61 kB
          Michael McCandless
        2. dedup.diffs.txt
          7 kB
          Michael McCandless

        Activity

          People

            lehmi Andreas Lehmkühler
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: