Apache OpenOffice (AOO) Bugzilla – Full Text Issue Listing |
Summary: | com.sun.xml.parser.SAXParserFactoryImpl does not support language codes with more than two characters | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | utilities | Reporter: | davidfraser <davidf> | ||||||
Component: | code | Assignee: | AOO issues mailing list <issues> | ||||||
Status: | ACCEPTED --- | QA Contact: | issues@tools <issues> | ||||||
Severity: | Trivial | ||||||||
Priority: | P3 | CC: | andreas.bille, gerry, hans-joachim.lankenau, issues, joerg.barfurth, lars.oppermann, ooo, pavel | ||||||
Version: | OOo 1.1.1 | Keywords: | needhelp | ||||||
Target Milestone: | OOo Later | ||||||||
Hardware: | All | ||||||||
OS: | All | ||||||||
Issue Type: | DEFECT | Latest Confirmation in: | --- | ||||||
Developer Difficulty: | --- | ||||||||
Issue Depends on: | 30380 | ||||||||
Issue Blocks: | |||||||||
Attachments: |
|
Description
davidfraser
2004-04-19 11:20:10 UTC
Make error log from Prof. Dr. Eduard Werner / Edward Wornar See http://l10n.openoffice.org/servlets/BrowseList?list=dev&by=thread&from=178070 -------------+ validating and creating a locale independent file mkdir -p ../../unxlngi4.pro/misc/registry/data/org/openoffice/Office/ java -classpath /home/edi/projekty/oo_1.1.1_src/solver/645/unxlngi4.pro/bin/jaxp.jar:/home/edi/projekty/oo_1.1.1_src/solver/645/unxlngi4.pro/bin/parser.jar:../../unxlngi4.pro/class/cfgimport.jar -Djavax.xml.parsers.SAXParserFactory=com.sun.xml.parser.SAXParserFactoryImpl org.openoffice.configuration.Inspector org/openoffice/Office/Common.xcu ** Start validating: file:/home/edi/projekty/oo_1.1.1_src/officecfg/registry/data/org/openoffice/Office/Common.xcu ** Parsing error, line 86, uri file:/home/edi/projekty/oo_1.1.1_src/officecfg/registry/data/org/openoffice/Office/Common.xcu Illegal xml:lang value "hsb". dmake: Error code 1, while making '../../unxlngi4.pro/misc/registry/data/org/openoffice/Office/Common.xcu' ---* TG_SLO.MK *--- ERROR: Error 65280 occurred while making /home/edi/projekty/oo_1.1.1_src/officecfg/registry/data dmake: Error code 1, while making 'build_all' Note: other related issues were in http://www.openoffice.org/issues/show_bug.cgi?id=19335 but have been fixed Adde mysefl to cc list trying to stay informed... What makes you think that this doesn't work for any ISO639-2 code? I suspect the problem is elsewhere (although it may amount to the same effect). The xml:lang value must be a valid locale. I suppose a java parser would use class java.lang.Locale to check this. IOW: this only supports languages that are know to the Java being used. A cursory glance a the Java API documentation raises doubts as to how well ISO639-2 languages are supported. It appears that Java 1.5 might bring improvements here (at least their reference document now lists many languages that only have ISO639-2 codes). Nevertheless the documentation still talks a lot about two-letter codes. For more details or more authoritative information, you should look starting from java.sun.com. To check whether a particular version of Java knows your locale, you could look through the list of locales returned by java.lang.Locale.getAvailableLocales(). BTW: Does your system support the hsb locale? As a workaround, you could try if switching to xsltproc based processing in officecfg (see the .IF $(SOLAR_JAVA) blocks in officecfg/util/makefile.pmk. If that resolves your problem, I can check if this can be made the standard processing. (Setting to confirmed, even though I haven't verified this myself) The reason I suspect it doesn't work for any ISO-639-2 language code is that it does work for *any* two-character language code, so it seems like its only checking the length, not the validity of the code. I'm trying to investigate but it seems quite difficult to get hold of the jaxp source code Joerg, Does your comment imply that we're not able to properly support locales that a particular Java version used during build time doesn't know about? This is ridiculous. How to circumvent this problem, giving at least the build a chance to not bail out, and what would be the consequences if those configuration entries in question weren't localized? Eike Reassigned. jb: the java.lang.Locale classes support any value for a locale string; see http://java.sun.com/j2se/1.4.2/docs/api/java/util/Locale.html "Because a Locale object is just an identifier for a region, no validity check is performed when you construct a Locale." I have verified this with a test program ("en", "de", "xy", "nso", "hsb", "xyz" are all accepted). So it is pretty clear this is an xml-specific problem. Eike, Generally we cannot easily support locales that XML processing tools (particularly validating parsers) used during build time don't recognize as valid. It probably depends on the parser used how they validate locale names. For Java parsers it seems likely that this is indeed determined by the Java implementation being used for the build. For other parsers it might be the the OS being used to build or the parser version itself. None of this is confirmed. David tries to find out more about how a Java parser (or our particular parser) behaves. For any replacement we would have to find out the behavior again, before we can move to it. Recently some people have added support for building officecfg without Java. Sadly noone contacted me, so there was no evaluation, whether it would be possible to use that solution (xsltproc) for all builds ( I am not even sure on what platforms and with which additional requirements their solution works). If we can do that I would gladly do it. But then we need to check first how xsltproc behaves wrt locale validation. The problem of course is that, if only one platform still requires the old parser, we can't merge the localized strings into the source - because it would break that platform for all locales. In the given case, not localizing the data would mean that Menu items under 'File - New' and 'File - Auto Pilot', as well as some other UI strings (e.g. names of OLE objects for our own document types) would not be localized. Joerg I refuse to be the owner of this. I don't even know which parser is used for what reason and who introduced it or why it doesn't accept any language string conforming to RFC3066 there. Reassigning to Joerg. I hate those radio buttons.. I will look into this. I can use any help with identifying why/where exactly this fails. Information about the behavior of other xml processors is also appreciated. Thanks Joerg I have been investigating the details I have narrowed down a minimal test case .xcu file and java class (based on Inspector.java) that tests using JAXP to parse it and produces the error, and will attach them. Created attachment 14656 [details]
Test-case java code to generate JAXP error on 3-char xml:lang value
Created attachment 14657 [details]
Test-case xml file to generate JAXP error on 3-char xml:lang value
According to http://www.w3.org/International/O-HTML-tags.html 3-character codes should be valid in XML. Also note that the parser doesn't fail for a random two-character code like "xy" so it is not neccessarily doing any real validation of the locale. It seems that it is using the JAXP implementation included in external/common/jaxp.jar This seems to be specification version 1.0.0, it seems really difficult to get hold of the source code and there are lots of versions floating around... The Inspector.java gets run by dmake with a classpath to include it, and a -Djavax.xml.parsers.SAXParserFactory=com.sun.xml.parser.SAXParserFactoryImpl switch. But the test case program I wrote seems to generate an error if I run with or without the classpath and -D Thus it would seem there is the same problem in the j2sdk 1.4.2 version of jaxp (it is the first release to include jaxp) See http://external.openoffice.org/forms/jaxp.html for information on including jaxp - Andreas Bille is listed as the OpenOffice.org contact, should he be added to Cc? OK, tracking down the parsing problem further ... it even gives you a problem if you don't use the validating parser (comment out factory.setValidating(true)). This means you can get the error on the following simple XML: <test xml:lang="xyz"> </test> OK have found the jaxp source (, and it seems to include a binary copy of Xerces, which is I suspect where the error is coming from. This is because in the source, the error message "Invalid xml:lang value" is only found in src/xml-xerces/java/src/org/apache/xerces/impl/msg/XMLMessages.properties where it is the value of XMLLangInvalid. The only other file containing XMLLangInvalid is src/xml-xalan/java/bin/xercesImpl.jar (dated Jun 3 2002) So the following URL seems relevant: http://sources.redhat.com/ml/xsl-list/2000-10/msg00985.html At least in 2000, Xerces did not support xml:lang with more than two characters (although see the next message saying it should) The README next to the jar says it is Xerces 2. So from http://xml.apache.org/dist/xerces-j/old_xerces2/ it might be 2.0.1 I can't find anything in that source or any other version of Xerces Not a Java expert so basically giving up here on working out why the parser does that, hopefully this info is useful to someone I'll attach a file with useful URLs as it took me ages to find some of them actually will include URLs here: openoffice page: http://external.openoffice.org/forms/jaxp.html SUN JAXP info: homepage: http://java.sun.com/xml/jaxp/ docs: http://www.java.sun.com/xml/jaxp/docs.html tutorial: http://www.java.sun.com/j2ee/1.4/docs/tutorial/doc/JAXPIntro.html faq: http://java.sun.com/xml/jaxp/faq.html - has links to implementations bug page: http://developer.java.sun.com/developer/bugParade/index.jshtml community source (have to sign up): http://wwws.sun.com/software/communitysource/jaxp/index.html download: * http://wwws.sun.com/software/communitysource/jaxp/download.html complaints about lack of source: http://forum.java.sun.com/thread.jsp?forum=34&thread=69199 http://forum.java.sun.com/thread.jsp?forum=34&thread=145059 http://forum.java.sun.com/thread.jsp?forum=34&thread=68754 - find it in JAVA_HOME/src.zip of jdk1.4 and the source distribution of xerces and xalan implementations: GNU classpath implementation: http://www.gnu.org/software/classpathx/jaxp/ Xerces: xerces Java-2 homepage: http://xml.apache.org/xerces2-j/index.html download: http://xml.apache.org/xerces2-j/download.cgi xerces java 1 homepage: http://xml.apache.org/xerces-j/index.html message about similar problem: http://sources.redhat.com/ml/xsl-list/2000-10/msg00985.html XML schema definition says you can have 3-char codes http://www.w3.org/TR/2004/PER-xmlschema-1-20040318/ http://www.w3.org/TR/2004/PER-xmlschema-2-20040318/datatypes.html#language David, Thanks a lot for spending time to track this down! Adding Andreas Bille to Cc. Btw: XML schema definition not only allows 3 character codes, it allows the full range of identifiers as defined by RFC3066. So please let's get rid of this stupid parser validation. Eike I submitted a bug report to Sun, not sure if that will do anything... It is ID 255115, at http://developer.java.sun.com/developer/bugParade/index.jshtml But they say it will take them three weeks to process, and won't be visible until then. Anyway the link may be helpful in future. This really is a problem of (the history of) the XML specification. The original specification document mandated rejecting anything but 2-letter language codes, there even is a grammar production for this (see http://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag). As this is part of the grammar, it is a well-formedness constraint, so a violation is an error even if the parser is not validating. At least the Crimson parser (which used to be in JAXP and whose sources I find in my installation of JDK 1.4.x) and some versions of Xerces2apparently implemented this faithfully. The 2nd edition takes a step towards relaxing this (by removing the grammar production) but still refers to IETF RFC 1766, which only knows ISO639 two-letter codes (see http://www.w3.org/TR/2000/REC-xml-20001006#sec-lang-tag). The 3rd edition then changes this reference to refer to IETF RFC 3066 and thereby finally makes ISO639-2 codes valid for xml:lang values. This change was released only Oct. 30th 2003. (See http://www.w3.org/TR/2003/PER-xml-20031030/#sec-lang-tag). The latest edition of the recommendation adds yet another twist: now empty values are allowed. (See http://www.w3.org/TR/2004/REC-xml-20040204/#sec-lang-tag). Looking closer the change that ISO639-2 codes became valid in XML even happened without any change in the XML recommendation: Edition 2000-10-06 already says that RFC1766 or its successor determine the valid values. In Jan. 2001 RFC3066 became the successor of RFC1766 and at that data 3-letter language codes became valid XML! For any such change (that extends the definition which documents are well-formed) it will take some time to spread to all XML parsers and to products (or build environments) built on top of such parsers. Certainly lack of support for RFC3066 in an XML parser has to be considered a bug nowadays. But I don't think you can really blame parsers released before the 2003-10-30 spec change for this. That said we need to get a newer parser into our build environment to support building for ISO639-2 languages. Whether that can be a more recent Java parser or a different one (e.g. libxml) remains to be evaluated. We also need to have a careful look, if adding ISO639-2 languages to the source will result in any xml:lang="<3-letters>" attributes appearing in installed files or documents. Because this would mean that such documents or files could not be processed with any older (and some newer) XML processing tools. Last, but certainly not least, we need to check whether the internal XML parser of OOo can cope with this. Got this response from Sun: I have reviewed your bug report and could reproduce the error in JDK 1.4.2. However I tried this in JDK 1.5.0-beta1 and saw that this issue has been fixed. Please try using JDK 1.5.0-beta1 available from: - http://java.sun.com/j2se/1.5.0/download.jsp So I guess a fix would be to include the JAXP included with JDK 1.5.0 with the openoffice source... I'll tentatively target this issue for 2.0. This depends on libxslt/xsltproc becoming available in the environment in that timeframe. Changing target to 'Office later', as it is unclear by what time the necessary xsltproc support will be available (legal review, etc.), and this is not release relevant for OOo 2.0 for most languages. Note 1: I will still try to address this as soon as xsltproc is available - even . Note 2: Maybe it will be possible to find a temporary solution for building ISO639-2 locales with a manually configured xsltproc (using the current no-Java rules), when CWS mergebuild is integrated, so that data for such locales won't be processed unless explicitly requested. |