Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Feedback Received
    • Affects Version/s: 2.0.8
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      At the moment pdfbox extracts all types of characters.
      Therefore control characters that occur will also be extracted.

      Unfortunately some of these control characters might deform text.
      For example 'MESSAGE WAITING' (U+0095) [MW]

      I attached some files and a screenshot how text is printed when MESSAGE WAITING is present.

      Should PDFBox handle this type of characters? Maybe suppress them in PDFTextStripper?

      I know that PDFBox works correctly in this case, a feature to turn off or suppress special characters might produce better output than the default Setting unless some control characters are used for any further processing!?

      Feedback appreciated.

      What other programs do:
      a) ignore control characters (Okular PDF Viewer - KDE)
      b) exchange them (Adobe Reader wrote a dot "." in place of MW)

      Regards

      Andreas

        Attachments

        1. 000016.pdf
          147 kB
          Tilman Hausherr
        2. Mapping_default_to_adobe.csv
          0.7 kB
          Andreas Meier
        3. Test_with_MW_AdobeReader_export.txt
          0.1 kB
          Andreas Meier
        4. Test_with_MW_linux.jpg
          10 kB
          Andreas Meier
        5. Test_with_MW.pdf
          13 kB
          Andreas Meier
        6. Test_with_MW.txt
          0.1 kB
          Andreas Meier
        7. Test_without_MW.txt
          0.1 kB
          Andreas Meier

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                AndreasMeier Andreas Meier
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: