Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.1.0
-
None
-
None
-
Operating System: All
Platform: All
-
15787
Description
The IANA charset names list is stored inefficiently. It alone takes up 200 kB
in the Xerces library.
internal/IANAEncodings.hpp contains const XMLCh gEncodingArray[791][128]. This
uses sizeof(XMLCh)*791*128 or about 200000 bytes. Most of the names are shorter
than 15 or so characters, and only ASCII characters are ever used in IANA
charset names. The names should therefore be stored as ASCII bytes, and only as
many per name as necessary.
As a simpler means of making this array smaller, the IANA charset registration
imposes an upper limit of 40 characters for charset names. There are only two
registered names that violate this (I think), they could be safely omitted. Add
space for the NUL. 128 characters per name is way overkill.
I also wonder whether this list is useful at all. Xerces only verifies that a
name exists in the list. It does not verify that it has a converter for it
(other than failing to open it, which does not use this list). It cannot verify
that what the XML document claims its charset is matches the converter that
Xerces is going to open for this name (e.g., mismatches between Shift-JIS etc.
among Windows/Unix/mainframe, see W3C Japanese profile for XML).
I suggest to add a compile-time option (#ifdef) to remove the IANA charset name
list (#ifdef out the use of EncodingValidator in util/TransService.cpp).
Note that ICU4C 2.2+ has data structures and APIs for dealing with charset
names associated with various standards (like IANA) and platforms. ICU4C does
not have a complete list of IANA names, but this is a matter of adding them to
its convrtrs.txt, not a real implementation issue.
Best regards,
markus