Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3281

HTML output wrongly specifies UTF-16 in header

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.0, 3.0.0 PDFBox
    • 2.0.1, 3.0.0 PDFBox
    • Text extraction
    • None
    • OS X 10.11.4, Java 1.8.0_73-b02

    Description

      When running the command line ExtractText with the -html flag, the output file always has the following meta tag specifying UTF-16 regardless of the actual output encoding:

      <meta http-equiv="Content-Type" content="text/html; charset="UTF-16">
      

      This causes editors that respect the meta tag (emacs, etc.) to garble the file content.

      Attachments

        1. testdoc.html
          0.3 kB
          Aaron Madlon-Kay
        2. testdoc.pdf
          8 kB
          Aaron Madlon-Kay

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              amake Aaron Madlon-Kay
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: