[ODFTOOLKIT-400] Unable to obtain the charset encoding of an odt document - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: odfdom
Labels:
None
Environment:
linux - ubuntu 14.04

Description

Im trying to convert odt to html. In doing the conversion Im trying to obtain the charset encoding of the odt document so that I can set the appropriate value on the html end. However I always get a 'null' value when trying to read the charset.

        OdfTextDocument odfDoc = OdfTextDocument.loadDocument(is)
        System.out.println(odfDoc.getContentDom.getXmlEncoding)

For the test document attached I am expecting to get UTF-8 but always see 'null'. Happens on other docs as well,

Is there a better way to obtain the charset encoding of an odt document?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

400-part1-pom_xml-FromJava1_5To1_6ForStAX.patch
09/Aug/15 13:53
0.5 kB
Nimarukan
400-part2-test-OdfFileDom_xmlDeclTest.patch
09/Aug/15 13:53
6 kB
Nimarukan
400-part3-main-OdfFileDom_initXmlDecl.patch
09/Aug/15 13:53
4 kB
Nimarukan
testOdt.odt
29/Jul/15 01:44
53 kB
Joshua

Activity

Ascending order - Click to sort in descending order

Nimarukan added a comment - 09/Aug/15 13:53 - edited

Diagnosis: No XML declaration fields of the DOM document are currently set because the file is parsed with a SAX parser, and SAX does not reveal the XML declaration to SAX handlers (org.xml.sax).

Approach: Parse the beginning bytes with a StAX parser (javax.xml.stream), which is included in Java 6 and later.

Attached are odfdom patches for

the pom.xml java version change,
the test case, and
the fix.

Details

POM: The source and target JDK versions are increased from JDK 1.5 to JDK 1.6 so that StAX (javax.xml.stream) will be available. The test case also uses java.nio.Charset from Java 6.

Test: The test case includes tests for the xml declaration fields:
xmlVersion, xmlEncoding, and xmlstandalone.

Fix: Change OdfFileDom.initialize() to use a StAX parser to read the XML declaration, and initialize the XML declaration fields.

(The XML declaration is parsed during initialization and not later because after the DOM is created, bytes are generated from the DOM, not the original file. For low overhead, the same internal-document byte array is used for both the StAX parser and SAX parser input streams. The StAX parser is closed immediately after the XML declaration fields are extracted and it does not read the rest of the stream.)

patch -p 1 -i 400-partN-xxx.patch

(note: OdfFileDom.java currently has a mix of '\n' and '\r\n' line terminators.)

Nimarukan added a comment - 09/Aug/15 13:53 - edited Diagnosis: No XML declaration fields of the DOM document are currently set because the file is parsed with a SAX parser, and SAX does not reveal the XML declaration to SAX handlers (org.xml.sax). Approach: Parse the beginning bytes with a StAX parser (javax.xml.stream), which is included in Java 6 and later. Attached are odfdom patches for the pom.xml java version change, the test case, and the fix. Details POM: The source and target JDK versions are increased from JDK 1.5 to JDK 1.6 so that StAX (javax.xml.stream) will be available. The test case also uses java.nio.Charset from Java 6. Test: The test case includes tests for the xml declaration fields: xmlVersion, xmlEncoding, and xmlstandalone. Fix: Change OdfFileDom.initialize() to use a StAX parser to read the XML declaration, and initialize the XML declaration fields. (The XML declaration is parsed during initialization and not later because after the DOM is created, bytes are generated from the DOM, not the original file. For low overhead, the same internal-document byte array is used for both the StAX parser and SAX parser input streams. The StAX parser is closed immediately after the XML declaration fields are extracted and it does not read the rest of the stream.) patch -p 1 -i 400-partN-xxx.patch (note: OdfFileDom.java currently has a mix of '\n' and '\r\n' line terminators.)

Michael Stahl added a comment - 14/Sep/15 10:16

it's a bit of a mystery to me what the point of this is.

the whole point of using a XML parser is so that you don't have to care about things like what charset and encoding the source document uses - it's all abstracted away and you only have to deal with nice and uniform Unicode text in your application.

specifically, the Java XML parsers all create java.lang.Strings which are always UTF-16 encoded Unicode.

so if you want to export UTF-8 encoded HTML, just do it by encoding the strings at the point when you write them into the generated file.

Michael Stahl added a comment - 14/Sep/15 10:16 it's a bit of a mystery to me what the point of this is. the whole point of using a XML parser is so that you don't have to care about things like what charset and encoding the source document uses - it's all abstracted away and you only have to deal with nice and uniform Unicode text in your application. specifically, the Java XML parsers all create java.lang.Strings which are always UTF-16 encoded Unicode. so if you want to export UTF-8 encoded HTML, just do it by encoding the strings at the point when you write them into the generated file.

ODF Toolkit

Unable to obtain the charset encoding of an odt document

Details

Description

Attachments

Attachments

Activity

People

Dates