Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3281

HTML output wrongly specifies UTF-16 in header

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0, 3.0.0 PDFBox
    • Fix Version/s: 2.0.1, 3.0.0 PDFBox
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      OS X 10.11.4, Java 1.8.0_73-b02

      Description

      When running the command line ExtractText with the -html flag, the output file always has the following meta tag specifying UTF-16 regardless of the actual output encoding:

      <meta http-equiv="Content-Type" content="text/html; charset="UTF-16">
      

      This causes editors that respect the meta tag (emacs, etc.) to garble the file content.

        Attachments

        1. testdoc.pdf
          8 kB
          Aaron Madlon-Kay
        2. testdoc.html
          0.3 kB
          Aaron Madlon-Kay

        Issue Links

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              amake Aaron Madlon-Kay

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment