I've pulled together some documentation on how different languages handle non-ASCII characters in identifiers. You'll see that language vary greatly in both what non-ASCII characters are allowed in identifiers, whether or not they are normalized, and how they are normalized when they are normalized.
One of the goals of Avro is to support specifications that interoperate well across languages. Given all the variability in how different languages handle non-ASCII characters, I stand by what I said earlier: handling Unicode well In Avro is a lot of work, and doing it poorly (as we do now) just creates nasty interop problems.
The Unicode consortium has published a recommendation for defining Unicode identifiers:
C# follows it almost exactly (but not exactly); Python follows it mostly; Java kind of follows it, but not really; C/C++ ignore it; and, as far as I can tell, neither Ruby nor PHP have given Unicode identifiers much thought at all.
Regarding Python, Python 2.x only allowed ASCII characters in identifiers. It wasn't until Python 3.x that Unicode characters were allowed. Phython 3.x follows the Unicode TR31. However, while Python calls for NRKC normalization, it does not use the "modified" NFKC normalization recommended in TR31.
C# follows Unicode TR31 exactly (except that it allows identifiers to start with an underscore). Thus, C#'s handling of non-ASCII identifiers is similar to Python's, except that C# calls for NFC rather than NFKC. Also, C# requires that its input arrives in normal form, and states that "The behavior when encountering an identifier not in Normalization Form C is implementation-defined; however, a diagnostic is not required" (presumably a diagnostic would be allowed). Python, on the other hand, says that "identifiers are converted into the normal form NFKC while parsing."
Java makes no reference to TR31, but it does seem to have been inspired by it. However, it's more restrictive than TR31 (and thus C# and Python). For example, while Python (and TR31) allow non-spacing marks, Java does not. Also, unlike TR31/C#/Python, the Java language does not call for normalization, and is rather explicit about this: "Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting, but these are different in identifiers."
C/C++ does not come close to TR31 and is very restrictive still. The specification lists just a few sets of non-ASCII letters that can be in an identifier (http://www.kuzbass.ru:8086/docs/isocpp/extendid.html#extendid). These exclude many other Unicode letters that are allowed by C#, Python and Java, and excludes other non-letter characters (such as connecting punction) allowed in those languages. Also, while TR31/C#/Java/Python allow non-Arabic digits in identifiers (e.g., Ethiopic digits), C/C++ does not.
PHP defines a letter as follows: "a letter is a-z, A-Z, and the bytes from 127 through 255 (0x7f-0xff)." It says nothing about Unicode, including anything about normalization. Since much of the time input is presumably in UTF-8, the 0x7f-0xff range implicitly captures everything in Unicode that isn't in the Basic Latin block – this goes way beyond what's allowed by the languages discussed above. In short, they just haven't thought about the problem.
I can't find a language spec for Ruby or much discussion on Unicode variables in that language. More generally, it looks like Ruby's support for Unicode was bad prior to 1.9 (Jan 2009). Here's a discussion of how 1.9 makes it better: http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html But there isn't any discussion of variable names.
Here's some summary info on support for Unicode variable-names in many different languages: