Uploaded image for project: 'Subversion'
  1. Subversion
  2. SVN-2487

mod_dav_svn and locales fail to play nicely together

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: trunk
    • Fix Version/s: 1.8.0
    • Component/s: mod_dav_svn
    • Labels:

      Description

      This problem can manifest itself in a number of ways, but the underlying root
      issue is that httpd runs with a locale of C, the default POSIX locale, which
      uses a 7 bit ASCII character set.  Internally in various places in Subversion we
      attempt to convert from our internal utf8 strings into native encoded strings,
      which fails spectacularly under DAV when the utf8 string includes multibyte
      characters because those characters cannot be expressed in 7 bit ASCII encoding.
      
      I brought this problem up on the HTTPD dev list, and the consensus is that httpd
      runs in the C locale because using the system locale (i.e. respecting the LANG
      or LC_ALL environment variables via a call to setlocale(LC_ALL, "") like svn
      does for all its command line programs) results in unpredictable behavior for
      various functions that depend on the locale.  Apparently some modules take
      matters into their own hands and use setlocale to set the locale manually, but
      this is very much not a recommended practice because it's a global setting and
      the results are not easily predicted.
      
      What is the end result of this?
      
      Well, you can't use 'svn lock' on a file that has multibyte characters in its
      path, because when mod_dav_svn tries to call the pre-lock hook script it needs
      to pass the filename, which it tries to translate into native encoding first. 
      This case is even weirder, since the hook script actually runs in an empty
      environment, so it has no way to know what its locale actually is because it
      can't access the appropriate environment variables from the parent httpd process.
      
      Perhaps more disturbing is that if a repository has multibyte characters in its
      path (i.e. it's just in a directory that's got multibyte charactes in its name)
      you can't do anything at all with it, even browsing the repository results in
      errors about not being able to open the repository, the underlying errors that
      show up in error_log are predictably about translating from utf8 -> native.
      
      What's the fix?  I have no idea.  We can't just trust that converting to
      "native" encoding is the correct thing to do in all cases, but unfortunately
      we've got an awful lot of code in svn that assumes that's what it should be
      doing when paths are passed from svn's internals into the outside world. 
      Additionally in the case of hook scripts it's not clear that converting to
      native is even desireable, since the script doesn't have any way to tell what
      native encoding actually is.
      

      1. 1_subversion-1.4.3-svn_locale_charset.patch
        3 kB
        Subversion Importer
      2. 2_subversion-1.6.5-svn_locale_charset.patch
        3 kB
        Subversion Importer
      3. 3_assumed-native-charset.diff
        2 kB
        Stefan Sperling
      4. 4_force-utf8.diff
        2 kB
        Stefan Sperling
      5. 5_mod_dav_svn.diff
        1 kB
        Stefan Sperling
      6. 6_no-env.diff
        7 kB
        Stefan Sperling

        Activity

        Hide
        subversion-importer Subversion Importer added a comment -

        I just stumbled over this issue and had to work around it somehow.
        
        C's locale concept (and, thus, most Unix command line programs used in hook 
        scripts) assume that there is only one native character set. I think that SVN 
        does the right thing when it converts all input from and all output to the 
        native character set.
        
        The problem is that mod_dav_svn relies on C's locale functions (by using 
        APR_LOCALE_CHARSET which causes a call to nl_langinfo(CODESET) which in turn 
        refers to the current C locale), but C's locale functions are hardly usable in 
        a server environment. Maybe httpd could call setlocale(LC_CTYPE,"") to only 
        change the character set. That's less disruptive but it might still be to much 
        in a server application.
        
        An alternative is to acquire the native charset name by some other means and 
        use it instead of APR_LOCALE_CHARSET. In theory it should be possible to launch 
        an external program that calls setlocale + nl_langinfo and writes the result to 
        stdout, avoiding the setlocale call in the server, but I haven't tried that yet 
        in practice because it's rather complicated.
        
        An enviroment variable is much simpler albeit ugly because the server 
        administrator has to manually replicate information. The patch I'll attach 
        allows overriding APR's APR_LOCALE_CHARSET by setting SVN_LOCALE_CHARSET in 
        httpd's enviroment. Just put "SVN_LOCALE_CHARSET=UTF-8 ; export 
        SVN_LOCALE_CHARSET" in httpd's start script.
        
        The patch only adresses the "Can't convert string from 'UTF-8' to native 
        encoding" problem. The enviroment of the hook scripts is still empty and lacks 
        the LANG / LC_* variables.
        
        

        Original comment by jegrp

        Show
        subversion-importer Subversion Importer added a comment - I just stumbled over this issue and had to work around it somehow. C's locale concept (and, thus, most Unix command line programs used in hook scripts) assume that there is only one native character set. I think that SVN does the right thing when it converts all input from and all output to the native character set. The problem is that mod_dav_svn relies on C's locale functions (by using APR_LOCALE_CHARSET which causes a call to nl_langinfo(CODESET) which in turn refers to the current C locale), but C's locale functions are hardly usable in a server environment. Maybe httpd could call setlocale(LC_CTYPE,"") to only change the character set. That's less disruptive but it might still be to much in a server application. An alternative is to acquire the native charset name by some other means and use it instead of APR_LOCALE_CHARSET. In theory it should be possible to launch an external program that calls setlocale + nl_langinfo and writes the result to stdout, avoiding the setlocale call in the server, but I haven't tried that yet in practice because it's rather complicated. An enviroment variable is much simpler albeit ugly because the server administrator has to manually replicate information. The patch I'll attach allows overriding APR's APR_LOCALE_CHARSET by setting SVN_LOCALE_CHARSET in httpd's enviroment. Just put "SVN_LOCALE_CHARSET=UTF-8 ; export SVN_LOCALE_CHARSET" in httpd's start script. The patch only adresses the "Can't convert string from 'UTF-8' to native encoding" problem. The enviroment of the hook scripts is still empty and lacks the LANG / LC_* variables. Original comment by jegrp
        Hide
        subversion-importer Subversion Importer added a comment -

        Created an attachment (id=636)
        SVN_LOCALE_CHARSET workaround
        
        

        Original comment by jegrp

        Show
        subversion-importer Subversion Importer added a comment - Created an attachment (id=636) SVN_LOCALE_CHARSET workaround Original comment by jegrp
        Hide
        subversion-importer Subversion Importer added a comment -

        Attachment 1_subversion-1.4.3-svn_locale_charset.patch has been added with description: SVN_LOCALE_CHARSET workaround

        Show
        subversion-importer Subversion Importer added a comment - Attachment 1_subversion-1.4.3-svn_locale_charset.patch has been added with description: SVN_LOCALE_CHARSET workaround
        Hide
        ehuelsmann Erik Huelsmann added a comment -

        As far as repositories with non-ascii characters in them, there's an easy
        workaround: don't do that.
        
        As far as the pre-lock hook goes: should we at all be translating the path to
        it's local representation? The charactersets supported on clients shouldn't be
        automatically assumed to be supported on the server...
        

        Show
        ehuelsmann Erik Huelsmann added a comment - As far as repositories with non-ascii characters in them, there's an easy workaround: don't do that. As far as the pre-lock hook goes: should we at all be translating the path to it's local representation? The charactersets supported on clients shouldn't be automatically assumed to be supported on the server...
        Hide
        subversion-importer Subversion Importer added a comment -

        There aren't any non-ASCII characters in the source file names here, but we're
        also using Subversion for localized documentation.  Restricting those file names
        to ASCII can be nasty, and it's not necessary.
        
        The server's charset only needs to be a superset of the characters actually
        used, not a superset of all clients' charsets, and on a server with some form of
        Unicode as its native charset (quite common today), there is no restriction at all.
        
        If a hook script receives a string in a charset other than the native charset,
        it can only treat it as an opaque byte sequence.  Pattern matching on those
        strings with shell built-ins or tools like grep may not work properly.  This
        makes passing non-native strings to shell scripts pretty much useless.  You
        would always need to write hook scripts in a more capable programming language
        that can handle UTF-8 regardless of the native charset.
        
        

        Original comment by jegrp

        Show
        subversion-importer Subversion Importer added a comment - There aren't any non-ASCII characters in the source file names here, but we're also using Subversion for localized documentation. Restricting those file names to ASCII can be nasty, and it's not necessary. The server's charset only needs to be a superset of the characters actually used, not a superset of all clients' charsets, and on a server with some form of Unicode as its native charset (quite common today), there is no restriction at all. If a hook script receives a string in a charset other than the native charset, it can only treat it as an opaque byte sequence. Pattern matching on those strings with shell built-ins or tools like grep may not work properly. This makes passing non-native strings to shell scripts pretty much useless. You would always need to write hook scripts in a more capable programming language that can handle UTF-8 regardless of the native charset. Original comment by jegrp
        Hide
        subversion-importer Subversion Importer added a comment -

        Created an attachment (id=1041)
        SVN_LOCALE_CHARSET workaround (updated for Subversion 1.6.5)
        
        

        Original comment by jegrp

        Show
        subversion-importer Subversion Importer added a comment - Created an attachment (id=1041) SVN_LOCALE_CHARSET workaround (updated for Subversion 1.6.5) Original comment by jegrp
        Hide
        subversion-importer Subversion Importer added a comment -

        Attachment 2_subversion-1.6.5-svn_locale_charset.patch has been added with description: SVN_LOCALE_CHARSET workaround (updated for Subversion 1.6.5)

        Show
        subversion-importer Subversion Importer added a comment - Attachment 2_subversion-1.6.5-svn_locale_charset.patch has been added with description: SVN_LOCALE_CHARSET workaround (updated for Subversion 1.6.5)
        Hide
        cmpilato C. Michael Pilato added a comment -

        Another possible fix:  a mod_dav_svn directive which allows admins to specify
        the "Subversion locale character set".
        
        Also, see more discussion here: http://svn.haxx.se/dev/archive-2010-01/0008.shtml
        

        Show
        cmpilato C. Michael Pilato added a comment - Another possible fix: a mod_dav_svn directive which allows admins to specify the "Subversion locale character set". Also, see more discussion here: http://svn.haxx.se/dev/archive-2010-01/0008.shtml
        Hide
        cmpilato C. Michael Pilato added a comment -

        Add 'patch' keyword to issues which have an attachment of the 'patch' type.
        

        Show
        cmpilato C. Michael Pilato added a comment - Add 'patch' keyword to issues which have an attachment of the 'patch' type.
        Hide
        subversion-importer Subversion Importer added a comment -

        A workaround and detailed tests & discussion are available at
        
        http://svn.apache.org/repos/asf/subversion/trunk/contrib/server-side/mod_setlocale/
        (first committed in r1232059)
        
        It is a simplistic apache httpd module that globally forces a given locale on
        the httpd process (and all its modules) -- a "quick hack" by danielsh that arose
        from an IRC discussion on #svn-dev.
        

        Original comment by neels

        Show
        subversion-importer Subversion Importer added a comment - A workaround and detailed tests & discussion are available at http://svn.apache.org/repos/asf/subversion/trunk/contrib/server-side/mod_setlocale/ (first committed in r1232059) It is a simplistic apache httpd module that globally forces a given locale on the httpd process (and all its modules) -- a "quick hack" by danielsh that arose from an IRC discussion on #svn-dev. Original comment by neels
        Hide
        subversion-importer Subversion Importer added a comment -

        *EDIT* above post:
        mod_setlocale is no longer in contrib, it is now available as "mod-setlocale" at
        apache-extras.org.
        
        SVN:
        http://svn.codespot.com/a/apache-extras.org/mod-setlocale/trunk
        
        ( http://apache-extras.org/p/mod-setlocale/ )
        
        

        Original comment by neels

        Show
        subversion-importer Subversion Importer added a comment - *EDIT* above post: mod_setlocale is no longer in contrib, it is now available as "mod-setlocale" at apache-extras.org. SVN: http://svn.codespot.com/a/apache-extras.org/mod-setlocale/trunk ( http://apache-extras.org/p/mod-setlocale/ ) Original comment by neels
        Hide
        stsp Stefan Sperling added a comment -

        Created an attachment (id=1206)
        simpler alternative to Jan Echternach's patch
        
        

        Show
        stsp Stefan Sperling added a comment - Created an attachment (id=1206) simpler alternative to Jan Echternach's patch
        Hide
        stsp Stefan Sperling added a comment -

        Attachment 3_assumed-native-charset.diff has been added with description: simpler alternative to Jan Echternach's patch

        Show
        stsp Stefan Sperling added a comment - Attachment 3_assumed-native-charset.diff has been added with description: simpler alternative to Jan Echternach's patch
        Hide
        stsp Stefan Sperling added a comment -

        The above patch is shorter than Jan's patch but has the same effect.
        
        I am not sure if we really want to do this. With locales, setlocale() will fail
        if the system does not support the requested character set. That is a useful
        safety check.
        
        With this patch we'd allow any character set to be specified without any safety
        check. Users could specify invalid character sets (i.e. not supported by the
        underlying iconv() implementation), or character sets that cause input to be
        mangled and lead to assertions. E.g. specifying UTF-16 can lead to assertion
        failures from functions like svn_dirent_is_absolute().
        

        Show
        stsp Stefan Sperling added a comment - The above patch is shorter than Jan's patch but has the same effect. I am not sure if we really want to do this. With locales, setlocale() will fail if the system does not support the requested character set. That is a useful safety check. With this patch we'd allow any character set to be specified without any safety check. Users could specify invalid character sets (i.e. not supported by the underlying iconv() implementation), or character sets that cause input to be mangled and lead to assertions. E.g. specifying UTF-16 can lead to assertion failures from functions like svn_dirent_is_absolute().
        Hide
        stsp Stefan Sperling added a comment -

        Given the inherent shoot-your-own-foot problem described above, how about we
        provide an override that causes svn to assume the native encoding is UTF-8? This
        is a lot safer because our code is guaranteed to work fine with UTF-8 (as
        opposed to, say, UTF-16).
        
        mod_dav_svn could then use this override, possibly only if a config file switch
        activates this new behaviour.
        
        Users who need to use multi-byte characters in repository names or hook scripts
        would then need to use UTF-8. Since UTF-8 is virtually universally supported at
        this point this should not be a problem.
        

        Show
        stsp Stefan Sperling added a comment - Given the inherent shoot-your-own-foot problem described above, how about we provide an override that causes svn to assume the native encoding is UTF-8? This is a lot safer because our code is guaranteed to work fine with UTF-8 (as opposed to, say, UTF-16). mod_dav_svn could then use this override, possibly only if a config file switch activates this new behaviour. Users who need to use multi-byte characters in repository names or hook scripts would then need to use UTF-8. Since UTF-8 is virtually universally supported at this point this should not be a problem.
        Hide
        stsp Stefan Sperling added a comment -

        Created an attachment (id=1207)
        Patch that implements the above suggestion.
        
        

        Show
        stsp Stefan Sperling added a comment - Created an attachment (id=1207) Patch that implements the above suggestion.
        Hide
        stsp Stefan Sperling added a comment -

        Attachment 4_force-utf8.diff has been added with description: Patch that implements the above suggestion.

        Show
        stsp Stefan Sperling added a comment - Attachment 4_force-utf8.diff has been added with description: Patch that implements the above suggestion.
        Hide
        stsp Stefan Sperling added a comment -

        Created an attachment (id=1208)
        mod_dav_svn part for above diff
        
        

        Show
        stsp Stefan Sperling added a comment - Created an attachment (id=1208) mod_dav_svn part for above diff
        Hide
        stsp Stefan Sperling added a comment -

        Attachment 5_mod_dav_svn.diff has been added with description: mod_dav_svn part for above diff

        Show
        stsp Stefan Sperling added a comment - Attachment 5_mod_dav_svn.diff has been added with description: mod_dav_svn part for above diff
        Hide
        stsp Stefan Sperling added a comment -

        Created an attachment (id=1209)
        Patch that does not use an environment variable
        
        

        Show
        stsp Stefan Sperling added a comment - Created an attachment (id=1209) Patch that does not use an environment variable
        Hide
        stsp Stefan Sperling added a comment -

        Attachment 6_no-env.diff has been added with description: Patch that does not use an environment variable

        Show
        stsp Stefan Sperling added a comment - Attachment 6_no-env.diff has been added with description: Patch that does not use an environment variable
        Hide
        stsp Stefan Sperling added a comment -

        The above patch implements the same fix but keeps its impact entirely within
        mod_dav_svn. The only way to trigger the new behaviour is via a SVNUseUTF8
        config switch in httpd.conf.
        

        Show
        stsp Stefan Sperling added a comment - The above patch implements the same fix but keeps its impact entirely within mod_dav_svn. The only way to trigger the new behaviour is via a SVNUseUTF8 config switch in httpd.conf.
        Hide
        stsp Stefan Sperling added a comment -

        More discussion at http://svn.haxx.se/dev/archive-2012-01/0486.shtml
        

        Show
        stsp Stefan Sperling added a comment - More discussion at http://svn.haxx.se/dev/archive-2012-01/0486.shtml
        Hide
        stsp Stefan Sperling added a comment -

        Previous patch committed in r1239203. Closing as FIXED for 1.8.
        

        Show
        stsp Stefan Sperling added a comment - Previous patch committed in r1239203. Closing as FIXED for 1.8.

          People

          • Assignee:
            Unassigned
            Reporter:
            rooneg Garrett Rooney
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development