Yonik, or anyone else, please let me know your thoughts on the following:
I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP chars wouldn't be lowercased. To rely on that would also be relying on undocumented behavior.
Ah, OK. Actually it just occurred to me that this would also require reindexing, otherwise queries that hit documents in the past would mysteriously start missing them (for text outside the BMP).
what should be our approach here wrt index back compat?
For the issues mentioned here, I cant possibly see >BMP working currently for anyone, but you are right it will change results.
I don't want to break index back compat, just wanted to mention that introducing Unicode 4 support, still with API back compat, with no performance degradation, is going to be somewhat challenging already.
If we want to somehow support the "broken" analysis components for index back compat, then we have to also have a broken implementation available on top of the correct impl (probably using Version to handle this).
In my opinion, this would introduce a lot of complexity, I will help do it though, if we must.