Uploaded image for project: 'Xerces-C++'
  1. Xerces-C++
  2. XERCESC-770

IANA charset names list inefficient; useful?

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.1.0
    • None
    • Utilities
    • None
    • Operating System: All
      Platform: All
    • 15787

    Description

      The IANA charset names list is stored inefficiently. It alone takes up 200 kB
      in the Xerces library.

      internal/IANAEncodings.hpp contains const XMLCh gEncodingArray[791][128]. This
      uses sizeof(XMLCh)*791*128 or about 200000 bytes. Most of the names are shorter
      than 15 or so characters, and only ASCII characters are ever used in IANA
      charset names. The names should therefore be stored as ASCII bytes, and only as
      many per name as necessary.

      As a simpler means of making this array smaller, the IANA charset registration
      imposes an upper limit of 40 characters for charset names. There are only two
      registered names that violate this (I think), they could be safely omitted. Add
      space for the NUL. 128 characters per name is way overkill.

      I also wonder whether this list is useful at all. Xerces only verifies that a
      name exists in the list. It does not verify that it has a converter for it
      (other than failing to open it, which does not use this list). It cannot verify
      that what the XML document claims its charset is matches the converter that
      Xerces is going to open for this name (e.g., mismatches between Shift-JIS etc.
      among Windows/Unix/mainframe, see W3C Japanese profile for XML).

      I suggest to add a compile-time option (#ifdef) to remove the IANA charset name
      list (#ifdef out the use of EncodingValidator in util/TransService.cpp).

      Note that ICU4C 2.2+ has data structures and APIs for dealing with charset
      names associated with various standards (like IANA) and platforms. ICU4C does
      not have a complete list of IANA names, but this is a matter of adding them to
      its convrtrs.txt, not a real implementation issue.

      Best regards,
      markus

      Attachments

        Activity

          People

            Unassigned Unassigned
            markus.scherer@jtcsv.com Markus Scherer
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: