A couple of points to consider:
1. " If we change the default that would change the results current users are getting." I am in absolute agreement here. I do not wish to change the default, the horse has left the gate. If I have referenced that in my previous posts, it's in the context of other ideas etc. My issue with the current default isn't the choice of what it is, as much as there is no way to change it on a case by base basis.
2. Paragraph 2: Those are solid points based on risks to the ongoing maintenance of Hive. That said, most Relational Databases, including MySQL have the option to change collation for given databases. True it may not be global setting (although the collation default is a global setting), but rather a per database setting. I.e. in MySQL you can have one database use a case sensitive collation and another use a case insensitive collation. Perhaps hive-site isn't the place for this, but metadata is (Set it as a per database or table setting?)
3. For the most part, we model functionality in hive based on what mysql does - Because of Point 1, this is moot, but by default, MySQL is case insensitive.
4. Most users do not want to have to heavily test before upgrade, they want consistent behaviour between versions. - This is exactly why I think we need an option for administrators of Hive to be able to set the case sensitivity on database/table level, and the default case sensitivity at a global level. Most users do not test. From a philosophical (data philosopher?) point of view, users will not test, therefore if failure is to happen because of lack of testing, please fail with pomp and circumstance.
I.e. Both lines of users (MySQL/MSSQL Migrating to Hive and Oracle/Postgres Migrating to Hive) are going to make assumptions when they write their queries that will cause a failure to happen. If LIKE is Case Sensitive as it is, Oracle/Postgres users will assume correctly and all will be well. MySQL/Users will assume wrong, but they won't KNOW they assumed wrong, they will just assume no results (given a lack of testing). The converse is actually the preferred model: If LIKE is not case sensitive, then Oracle/Postgres users will assume incorrectly, they will run a query, and they will get their expected results, but will also get extra results allowing them to understand they assumed wrong. MySQL/MSSQL users will assume correctly and get their results.
Going back to Point 1, the horse has left the building, we can't change default because of what it would do to the current user base, thus the next best option is to allow administrators to set that so users don't have to deal with it, document it so good administrators can handle it out of the gate, and to find a way to do it like '%MysQL%' (i.e. similar to collation settings on databases rather then a global setting as described previously).
I do see the challenges with a global setting affecting how a function works; no precedence for that, and thus it could introduce risk, but I still hold that the current risk, especially on a system that is touted as modeled after MySQL (that's how it was explained to me, thus I assumed LIKE was case insensitive), to users is high. As I was typing this novelette, I realized I tossed out an idea related to per database settings. Thoughts on that? That is similar to how MySQL handles it, and thus follows the models mentioned while avoiding a global setting that affects the behavior of a UDF. (The setting doesn't change the UDF, the "collation equivalent" setting on the databases does, and thus the global setting is just the default collation equivalent.