Resolution: Feedback Received
Affects Version/s: 2.0.8
Fix Version/s: None
Component/s: Text extraction
At the moment pdfbox extracts all types of characters.
Therefore control characters that occur will also be extracted.
Unfortunately some of these control characters might deform text.
For example 'MESSAGE WAITING' (U+0095) [MW]
I attached some files and a screenshot how text is printed when MESSAGE WAITING is present.
Should PDFBox handle this type of characters? Maybe suppress them in PDFTextStripper?
I know that PDFBox works correctly in this case, a feature to turn off or suppress special characters might produce better output than the default Setting unless some control characters are used for any further processing!?
What other programs do:
a) ignore control characters (Okular PDF Viewer - KDE)
b) exchange them (Adobe Reader wrote a dot "." in place of MW)