Uploaded image for project: 'Xerces-C++'
  1. Xerces-C++
  2. XERCESC-1166

Xerces cannot open file whose name includes UTF8 characters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Resolution: Fixed
    • 2.4.0
    • 2.5.0
    • Utilities
    • None
    • Operating System: Other
      Platform: Macintosh
    • 27270

    Description

      I originally wrote about this as attached below.
      James Berry asked me to file the big report, see his e-mail below as well.

      On Feb 25, 2004, at 5:31 PM, Mark Goldstein wrote:

      Hello,

      Using Xalan/Xerces I tried to transform a file with a name that included an "e"
      with accent. Your mailer might show it:
      féébad.xml

      The command line call (using Mac OS-X copy/paste which converts the characters
      to octal constants) looks like this:
      mark$ ./Xalan -o foo.out fe\314\201e\314\201bad.xml foo.xsl

      And it results in this error:
      Fatal Error at (unknown file , line 0 , column

      {null}

      ): An exception occurred!
      Type:RuntimeException,
      Message:The primary document entity could not be opened. Id=féébad.xml
      SAXParseException: An exception occurred! Type:RuntimeException,
      Message:The primary document entity could not be opened. Id=féébad.xml (, line
      0, column 0)

      Is this a known bug? Is there a work-around?

      This isn't a known bug, but, having done a bit of snooping, I do believe that it
      is a bug.

      Here's what I think is going on:

      Xerces creates a transcoder that converts from the local code page to unicode
      (LCP Transcoder). On Mac OS, it assumes the local code page is whatever the
      default system script encoding is, which is often MacRoman. This LCP Transcoder
      is used whenever a XMLString is created from a char*. That is done, for
      instance, as part of taking a file off the command line and creating a parser
      from it.

      The problem in your case is that the characters coming off the Mac OS X command
      line are actually utf-8, not (MacRoman, or whatever). They're being converted to
      utf-16 as if they were MacRoman. And all hell breaks loose, including the
      unfortunate fact that the file can't be opened.

      This is a bit of a no-win situation. We could simply make the LCP Transcoder
      assume the LCP is always utf-8, but that would require a major re-architecting
      of the transcoder, since we rely on the lower level unicode converter, which
      can't transcode between unicode encodings, only to and from them. It also may
      not be quite the right answer either, since it just fixes the situation for the
      command line and ignores the fact that there are a number of other LCP encodings
      being used, which this decision could affect.

      There are probably a number of workarounds, but they all basically boil down to
      not relying on the LCP transcoder to convert the utf-8 string from the command
      line into unicode in the first place. For instance, you could explicitly call
      the intrinsic utf-8 transcoder through Transervice, or cheat and call
      TranscodeUTF8ToUniChars, which is buried down in MacOSPlatformUtils. There are
      probably better solutions, but it's getting late for me now. Once you have the
      filename in utf-16, pass that directly into the parser.

      There may be other simpler workarounds, which might include simply changing the
      encoding of text in the terminal to MacRoman, or whatever. But it's making my
      head hurt to understand the interactions that would occur in that case...your
      file wouldn't list correctly in that case, I would think.

      Please let me know how it goes, and if you could write a bug report that would
      help as well.

      James.

      Attachments

        Activity

          People

            jberry@apache.org James Berry
            kvetch@att.net Mark Goldstein
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: