Description
ExtractText (and perhaps also PDFTextStripper) should default to UTF-8, which is what most people expect. There have been two long-standing open issues PDFBOX-755, PDFBOX-970, because of not using having a good default.
I've escalated this to a bug, see the first comment.
Attachments
Issue Links
- is related to
-
PDFBOX-3281 HTML output wrongly specifies UTF-16 in header
- Closed