Changes from Andi's version:
- Changed the name of the class to ASCIIFoldingFilter
- Added the Unicode chracter descriptions to comments on each character
- Added a test class
- Added several other Unicode blocks from which characters are converted to their ASCII equivalents. Added characters include digits and punctuation.
I did not provide mappings for characters from the Math block - flattening circled plus, for example, didn't seem appropriate.
I did provide mappings for IPA and two other phonetic character blocks, and I'm not sure whether this is appropriate. I was following what seemed to me to be the logic of Andi's mappings, and those provided by Latin1AccentFilter: convert characters to those that look like them in ASCII. As a result, e.g., the character described as "LATIN SMALL LETTER TURNED M" (U+0270) from the IPA block is mapped to "m", regardless of its actual phonetic value.
There are lots of mappings in there now. I generated the mappings by Perl scripting over the contents of the Unicode 5.1 version of UnicodeData.txt from Unicode.org, after grep'ing e.g. for "LATIN" and "LETTER" or "DIGRAPH", etc., and then moved things around to the appropriate places by hand. I guess this is one weakness of this patch: it's large enough that manual verification is tough. It's my hope that adding the Unicode character descriptions will allow for at least improved verifiability.