Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3281

HTML output wrongly specifies UTF-16 in header

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0, 3.0.0 PDFBox
    • Fix Version/s: 2.0.1, 3.0.0 PDFBox
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      OS X 10.11.4, Java 1.8.0_73-b02

      Description

      When running the command line ExtractText with the -html flag, the output file always has the following meta tag specifying UTF-16 regardless of the actual output encoding:

      <meta http-equiv="Content-Type" content="text/html; charset="UTF-16">
      

      This causes editors that respect the meta tag (emacs, etc.) to garble the file content.

        Attachments

        1. testdoc.html
          0.3 kB
          Aaron Madlon-Kay
        2. testdoc.pdf
          8 kB
          Aaron Madlon-Kay

          Issue Links

            Activity

              People

              • Assignee:
                lehmi Andreas Lehmkühler
                Reporter:
                amake Aaron Madlon-Kay
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: