Issue 27964

Summary: com.sun.xml.parser.SAXParserFactoryImpl does not support language codes with more than two characters
Product: utilities Reporter: davidfraser <davidf>
Component: codeAssignee: AOO issues mailing list <issues>
Status: ACCEPTED --- QA Contact: issues@tools <issues>
Severity: Trivial    
Priority: P3 CC: andreas.bille, gerry, hans-joachim.lankenau, issues, joerg.barfurth, lars.oppermann, ooo, pavel
Version: OOo 1.1.1Keywords: needhelp
Target Milestone: OOo Later   
Hardware: All   
OS: All   
Issue Type: DEFECT Latest Confirmation in: ---
Developer Difficulty: ---
Issue Depends on: 30380    
Issue Blocks:    
Attachments:
Description Flags
Test-case java code to generate JAXP error on 3-char xml:lang value
none
Test-case xml file to generate JAXP error on 3-char xml:lang value none

Description davidfraser 2004-04-19 11:20:10 UTC
When trying to build OpenOffice.org with a ISO-639-2 language code that has more
than two characters, the XML parser bombs out on the xcu files.
It does not matter what the code is, it seems to require a two-character code.
This prevents using the longer codes for localization.
Comment 1 davidfraser 2004-04-19 11:24:16 UTC
Make error log from Prof. Dr. Eduard Werner / Edward Wornar
See http://l10n.openoffice.org/servlets/BrowseList?list=dev&by=thread&from=178070

-------------+ validating and creating a locale independent file
mkdir -p ../../unxlngi4.pro/misc/registry/data/org/openoffice/Office/
java -classpath 
/home/edi/projekty/oo_1.1.1_src/solver/645/unxlngi4.pro/bin/jaxp.jar:/home/edi/projekty/oo_1.1.1_src/solver/645/unxlngi4.pro/bin/parser.jar:../../unxlngi4.pro/class/cfgimport.jar

-Djavax.xml.parsers.SAXParserFactory=com.sun.xml.parser.SAXParserFactoryImpl 
org.openoffice.configuration.Inspector org/openoffice/Office/Common.xcu
** Start validating: 
file:/home/edi/projekty/oo_1.1.1_src/officecfg/registry/data/org/openoffice/Office/Common.xcu

** Parsing error, line 86, uri 
file:/home/edi/projekty/oo_1.1.1_src/officecfg/registry/data/org/openoffice/Office/Common.xcu
   Illegal xml:lang value "hsb".
dmake:  Error code 1, while making 
'../../unxlngi4.pro/misc/registry/data/org/openoffice/Office/Common.xcu'
---* TG_SLO.MK *---

ERROR: Error 65280 occurred while making 
/home/edi/projekty/oo_1.1.1_src/officecfg/registry/data
dmake:  Error code 1, while making 'build_all'
Comment 2 davidfraser 2004-04-19 11:26:52 UTC
Note: other related issues were in
http://www.openoffice.org/issues/show_bug.cgi?id=19335 but have been fixed
Comment 3 grsingleton 2004-04-19 16:36:17 UTC
Adde mysefl to cc list
Comment 4 hjs 2004-04-19 17:12:11 UTC
trying to stay informed...
Comment 5 joerg.barfurth 2004-04-20 10:54:26 UTC
What makes you think that this doesn't work for any ISO639-2 code? I suspect the
problem is elsewhere (although it may amount to the same effect).

The xml:lang value must be a valid locale. I suppose a java parser would use
class java.lang.Locale to check this. IOW: this only supports languages that are
know to the Java being used.

A cursory glance a the Java API documentation raises doubts as to how well
ISO639-2 languages are supported. It appears that Java 1.5 might bring
improvements here (at least their reference document now lists many languages
that only have ISO639-2 codes). Nevertheless the documentation still talks a lot
about two-letter codes. For more details or more authoritative information, you
should look starting from java.sun.com.

To check whether a particular version of Java knows your locale, you could look
through the list of locales returned by java.lang.Locale.getAvailableLocales().

BTW: Does your system support the hsb locale?

As a workaround, you could try if switching to xsltproc based processing in
officecfg (see the .IF $(SOLAR_JAVA) blocks in officecfg/util/makefile.pmk. If
that resolves your problem, I can check if this can be made the standard processing.

(Setting to confirmed, even though I haven't verified this myself)
Comment 6 davidfraser 2004-04-20 11:30:42 UTC
The reason I suspect it doesn't work for any ISO-639-2 language code is that it
does work for *any* two-character language code, so it seems like its only
checking the length, not the validity of the code.
I'm trying to investigate but it seems quite difficult to get hold of the jaxp
source code
Comment 7 ooo 2004-04-20 13:41:40 UTC
Joerg,

Does your comment imply that we're not able to properly support locales that a
particular Java  version used during build time doesn't know about? This is
ridiculous.

How to circumvent this problem, giving at least the build a chance to not bail
out, and what would be the consequences if those configuration entries in
question weren't localized?

Eike
Comment 8 hennes.rohling 2004-04-20 14:14:00 UTC
Reassigned.
Comment 9 davidfraser 2004-04-20 14:36:01 UTC
jb: the java.lang.Locale classes support any value for a locale string; see 
http://java.sun.com/j2se/1.4.2/docs/api/java/util/Locale.html
"Because a Locale object is just an identifier for a region, no validity check
is performed when you construct a Locale."
I have verified this with a test program ("en", "de", "xy", "nso", "hsb", "xyz"
are all accepted).
So it is pretty clear this is an xml-specific problem.
Comment 10 joerg.barfurth 2004-04-20 14:36:40 UTC
Eike,

Generally we cannot easily support locales that XML processing tools
(particularly validating parsers) used during build time don't recognize as valid. 

It probably depends on the parser used how they validate locale names. For Java
parsers it seems likely that this is indeed determined by the Java
implementation being used for the build. For other parsers it might be the the
OS being used to build or the parser version itself.

None of this is confirmed. David tries to find out more about how a Java parser
(or our particular parser) behaves. For any replacement we would have to find
out the behavior again, before we can move to it.

Recently some people have added support for building officecfg without Java.
Sadly noone contacted me, so there was no evaluation, whether it would be
possible to use that solution (xsltproc) for all builds ( I am not even sure on
what platforms and with which additional requirements their solution works). If
we can do that I would gladly do it. But then we need to check first how
xsltproc behaves wrt locale validation.

The problem of course is that, if only one platform still requires the old
parser, we can't merge the localized strings into the source - because it would
break that platform for all locales.

In the given case, not localizing the data would mean that Menu items under
'File - New' and 'File - Auto Pilot', as well as some other UI strings (e.g.
names of OLE objects for our own document types) would not be localized.

Joerg
Comment 11 ooo 2004-04-20 15:01:41 UTC
I refuse to be the owner of this. I don't even know which parser is used for
what reason and who introduced it or why it doesn't accept any language string
conforming to RFC3066 there. Reassigning to Joerg.
Comment 12 ooo 2004-04-20 15:02:45 UTC
I hate those radio buttons..
Comment 13 joerg.barfurth 2004-04-20 15:12:13 UTC
I will look into this.

I can use any help with identifying why/where exactly this fails.
Information about the behavior of other xml processors is also appreciated.

Comment 14 davidfraser 2004-04-20 15:22:58 UTC
Thanks Joerg
I have been investigating the details
I have narrowed down a minimal test case .xcu file and java class (based on
Inspector.java) that tests using JAXP to parse it and produces the error, and
will attach them.
Comment 15 davidfraser 2004-04-20 15:24:59 UTC
Created attachment 14656 [details]
Test-case java code to generate JAXP error on 3-char xml:lang value
Comment 16 davidfraser 2004-04-20 15:27:42 UTC
Created attachment 14657 [details]
Test-case xml file to generate JAXP error on 3-char xml:lang value
Comment 17 davidfraser 2004-04-20 15:29:24 UTC
According to http://www.w3.org/International/O-HTML-tags.html 3-character codes
should be valid in XML.
Also note that the parser doesn't fail for a random two-character code like "xy"
so it is not neccessarily doing any real validation of the locale.
Comment 18 davidfraser 2004-04-20 15:39:40 UTC
It seems that it is using the JAXP implementation included in
external/common/jaxp.jar
This seems to be specification version 1.0.0, it seems really difficult to get
hold of the source code and there are lots of versions floating around...
The Inspector.java gets run by dmake with a classpath to include it, and a
-Djavax.xml.parsers.SAXParserFactory=com.sun.xml.parser.SAXParserFactoryImpl switch.
But the test case program I wrote seems to generate an error if I run with or
without the classpath and -D
Thus it would seem there is the same problem in the j2sdk 1.4.2 version of jaxp
(it is the first release to include jaxp)
Comment 19 davidfraser 2004-04-20 15:52:57 UTC
See http://external.openoffice.org/forms/jaxp.html for information on including
jaxp - Andreas Bille is listed as the OpenOffice.org contact, should he be added
to Cc?
Comment 20 davidfraser 2004-04-20 16:25:37 UTC
OK, tracking down the parsing problem further ... it even gives you a problem if
you don't use the validating parser (comment out factory.setValidating(true)).
This means you can get the error on the following simple XML:

<test xml:lang="xyz">
</test>

Comment 21 davidfraser 2004-04-20 16:48:59 UTC
OK have found the jaxp source (, and it seems to include a binary copy of
Xerces, which is I suspect where the error is coming from.
This is because in the source, the error message "Invalid xml:lang value" is
only found in
src/xml-xerces/java/src/org/apache/xerces/impl/msg/XMLMessages.properties where
it is the value of XMLLangInvalid. The only other file containing XMLLangInvalid
is src/xml-xalan/java/bin/xercesImpl.jar (dated Jun 3 2002)
So the following URL seems relevant:
http://sources.redhat.com/ml/xsl-list/2000-10/msg00985.html
At least in 2000, Xerces did not support xml:lang with more than two characters
(although see the next message saying it should)
The README next to the jar says it is Xerces 2. So from
http://xml.apache.org/dist/xerces-j/old_xerces2/ it might be 2.0.1
I can't find anything in that source or any other version of Xerces

Not a Java expert so basically giving up here on working out why the parser does
that, hopefully this info is useful to someone
I'll attach a file with useful URLs as it took me ages to find some of them
Comment 22 davidfraser 2004-04-20 16:51:36 UTC
actually will include URLs here:
openoffice page:
  http://external.openoffice.org/forms/jaxp.html

SUN JAXP info:
  homepage:
    http://java.sun.com/xml/jaxp/
  docs:
    http://www.java.sun.com/xml/jaxp/docs.html
  tutorial:
    http://www.java.sun.com/j2ee/1.4/docs/tutorial/doc/JAXPIntro.html
  faq:
    http://java.sun.com/xml/jaxp/faq.html
    - has links to implementations
  bug page:
    http://developer.java.sun.com/developer/bugParade/index.jshtml
  community source (have to sign up):
    http://wwws.sun.com/software/communitysource/jaxp/index.html
    download:
    * http://wwws.sun.com/software/communitysource/jaxp/download.html

complaints about lack of source:
  http://forum.java.sun.com/thread.jsp?forum=34&thread=69199
  http://forum.java.sun.com/thread.jsp?forum=34&thread=145059
  http://forum.java.sun.com/thread.jsp?forum=34&thread=68754
  - find it in JAVA_HOME/src.zip of jdk1.4 and the source distribution of xerces
and xalan
implementations:
  GNU classpath implementation:
    http://www.gnu.org/software/classpathx/jaxp/

Xerces:
  xerces Java-2 homepage:
    http://xml.apache.org/xerces2-j/index.html
  download:
    http://xml.apache.org/xerces2-j/download.cgi
  xerces java 1 homepage:
    http://xml.apache.org/xerces-j/index.html
  message about similar problem:
    http://sources.redhat.com/ml/xsl-list/2000-10/msg00985.html

XML schema definition says you can have 3-char codes
  http://www.w3.org/TR/2004/PER-xmlschema-1-20040318/
  http://www.w3.org/TR/2004/PER-xmlschema-2-20040318/datatypes.html#language
Comment 23 ooo 2004-04-20 17:01:05 UTC
David,

Thanks a lot for spending time to track this down!

Adding Andreas Bille to Cc.

Btw: XML schema definition not only allows 3 character codes, it allows the full
range of identifiers as defined by RFC3066. So please let's get rid of this
stupid parser validation.

Eike
Comment 24 davidfraser 2004-04-20 21:34:39 UTC
I submitted a bug report to Sun, not sure if that will do anything...
It is ID 255115, at http://developer.java.sun.com/developer/bugParade/index.jshtml
But they say it will take them three weeks to process, and won't be visible
until then. Anyway the link may be helpful in future.
Comment 25 joerg.barfurth 2004-04-22 15:09:21 UTC
This really is a problem of (the history of) the XML specification.

The original specification document mandated rejecting anything but 2-letter
language codes, there even is a grammar production for this (see
http://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag). As this is part of the
grammar, it is a well-formedness constraint, so a violation is an error even if
the parser is not validating.

At least the Crimson parser (which used to be in JAXP and whose sources I find
in my installation of JDK 1.4.x) and some versions of Xerces2apparently 
implemented this faithfully.

The 2nd edition takes a step towards relaxing this (by removing the grammar
production) but still refers to IETF RFC 1766, which only knows ISO639
two-letter codes (see http://www.w3.org/TR/2000/REC-xml-20001006#sec-lang-tag).

The 3rd edition then changes this reference to refer to IETF RFC 3066 and
thereby finally makes ISO639-2 codes valid for xml:lang values. This change was
released only Oct. 30th 2003. (See
http://www.w3.org/TR/2003/PER-xml-20031030/#sec-lang-tag).

The latest edition of the recommendation adds yet another twist: now empty
values are allowed.  (See http://www.w3.org/TR/2004/REC-xml-20040204/#sec-lang-tag).

Looking closer the change that ISO639-2 codes became valid in XML even happened
without any change in the XML recommendation: Edition 2000-10-06 already says
that RFC1766 or its successor determine the valid values. In Jan. 2001 RFC3066
became the successor of RFC1766 and at that data 3-letter language codes became
valid XML!

For any such change (that extends the definition which documents are
well-formed) it will take some time to spread to all XML parsers and to products
(or build environments) built on top of such parsers.

Certainly lack of support for RFC3066 in an XML parser has to be considered a
bug nowadays. But I don't think you can really blame parsers released before the
2003-10-30 spec change for this.

That said we need to get a newer parser into our build environment to support
building for ISO639-2 languages. Whether that can be a more recent Java parser
or a different one (e.g. libxml) remains to be evaluated. 

We also need to have a careful look, if adding ISO639-2 languages to the source
will result in any xml:lang="<3-letters>" attributes appearing in installed
files or documents. Because this would mean that such documents or files could
not be processed with any older (and some newer) XML processing tools. 

Last, but certainly not least, we need to check whether the internal XML parser
of OOo can cope with this.


Comment 26 davidfraser 2004-04-22 19:52:55 UTC
Got this response from Sun:

I have reviewed your bug report and could reproduce the error in JDK 1.4.2.
However I tried this in JDK 1.5.0-beta1 and saw that this issue has been fixed.
Please try using JDK 1.5.0-beta1 available from: -

http://java.sun.com/j2se/1.5.0/download.jsp

So I guess a fix would be to include the JAXP included with JDK 1.5.0 with the
openoffice source...
Comment 27 joerg.barfurth 2004-06-04 12:07:11 UTC
I'll tentatively target this issue for 2.0. This depends on libxslt/xsltproc
becoming available in the environment in that timeframe.
Comment 28 joerg.barfurth 2004-06-23 14:41:24 UTC
Changing target to 'Office later', as it is unclear by what time the necessary
xsltproc support will be available (legal review, etc.), and this is not release
relevant for OOo 2.0 for most languages.

Note 1: I will still try to address this as soon as xsltproc is available - even .

Note 2: Maybe it will be possible to find a temporary solution for building
ISO639-2 locales with a manually configured xsltproc (using the current no-Java
rules), when CWS mergebuild is integrated, so that data for such locales won't
be processed unless explicitly requested.