Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4141

Suppress control characters?

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Feedback Received
    • 2.0.8
    • None
    • Text extraction
    • None

    Description

      At the moment pdfbox extracts all types of characters.
      Therefore control characters that occur will also be extracted.

      Unfortunately some of these control characters might deform text.
      For example 'MESSAGE WAITING' (U+0095) [MW]

      I attached some files and a screenshot how text is printed when MESSAGE WAITING is present.

      Should PDFBox handle this type of characters? Maybe suppress them in PDFTextStripper?

      I know that PDFBox works correctly in this case, a feature to turn off or suppress special characters might produce better output than the default Setting unless some control characters are used for any further processing!?

      Feedback appreciated.

      What other programs do:
      a) ignore control characters (Okular PDF Viewer - KDE)
      b) exchange them (Adobe Reader wrote a dot "." in place of MW)

      Regards

      Andreas

      Attachments

        1. Test_without_MW.txt
          0.1 kB
          Andreas Meier
        2. Test_with_MW.txt
          0.1 kB
          Andreas Meier
        3. Test_with_MW.pdf
          13 kB
          Andreas Meier
        4. Test_with_MW_linux.jpg
          10 kB
          Andreas Meier
        5. Test_with_MW_AdobeReader_export.txt
          0.1 kB
          Andreas Meier
        6. Mapping_default_to_adobe.csv
          0.7 kB
          Andreas Meier
        7. 000016.pdf
          147 kB
          Tilman Hausherr

        Issue Links

          Activity

            People

              Unassigned Unassigned
              AndreasMeier Andreas Meier
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: