[PDFBOX-3281] HTML output wrongly specifies UTF-16 in header - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0, 3.0.0 PDFBox
Fix Version/s: 2.0.1, 3.0.0 PDFBox
Component/s: Text extraction
Labels:
None
Environment:
OS X 10.11.4, Java 1.8.0_73-b02

Description

When running the command line ExtractText with the -html flag, the output file always has the following meta tag specifying UTF-16 regardless of the actual output encoding:

<meta http-equiv="Content-Type" content="text/html; charset="UTF-16">

This causes editors that respect the meta tag (emacs, etc.) to garble the file content.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

testdoc.html
22/Mar/16 04:37
0.3 kB
Aaron Madlon-Kay
testdoc.pdf
22/Mar/16 04:37
8 kB
Aaron Madlon-Kay

Issue Links

relates to

PDFBOX-2384 ExtractText should default to UTF-8

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Aaron Madlon-Kay

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Mar/16 04:36

Updated:: 25/Mar/17 18:12

Resolved:: 20/Apr/16 15:59