Hive
  1. Hive
  2. HIVE-4070

Like operator in Hive is case sensitive while in MySQL (and most likely other DBs) it's case insensitive

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Trivial Trivial
    • Resolution: Unresolved
    • Affects Version/s: 0.10.0
    • Fix Version/s: None
    • Component/s: UDF
    • Labels:
      None

      Description

      Hive's like operator seems to be case sensitive.
      See https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java#L164

      However, MySQL's like operator is case insensitive. I don't have other DB's (like PostgreSQL) installed and handy but I am guessing their LIKE is case insensitive as well.

        Activity

        Hide
        Sean Mackrory added a comment -

        Though admittedly it was a question of mine trying to clarify HiveQL that spawned this thought, I don't necessarily think HiveQL's LIKE operator should be case insensitive. It would certainly make HiveQL more consistent with SQL if it was, but when running extremely long queries over certain types of data, I could see it being a nice feature that LIKE gives you the flexibility of wild-cards without the overhead of case insensitivity. There's always the option of using RLIKE when one is looking for a very flexible match, but I think this is a toss-up between compliance with SQL's behavior and being able to limit the overhead of the query when possible.

        Show
        Sean Mackrory added a comment - Though admittedly it was a question of mine trying to clarify HiveQL that spawned this thought, I don't necessarily think HiveQL's LIKE operator should be case insensitive. It would certainly make HiveQL more consistent with SQL if it was, but when running extremely long queries over certain types of data, I could see it being a nice feature that LIKE gives you the flexibility of wild-cards without the overhead of case insensitivity. There's always the option of using RLIKE when one is looking for a very flexible match, but I think this is a toss-up between compliance with SQL's behavior and being able to limit the overhead of the query when possible.
        Hide
        Gwen Shapira added a comment -

        Oracle's "LIKE" (as well as any other char/varchar comparison) is case sensitive.
        No matter how HiveQL behaves it can't be consistent with every SQL implementation out there.

        Show
        Gwen Shapira added a comment - Oracle's "LIKE" (as well as any other char/varchar comparison) is case sensitive. No matter how HiveQL behaves it can't be consistent with every SQL implementation out there.
        Hide
        Mark Grover added a comment -

        Thanks Gwen. In that case, I am ok with documenting it and resolving this JIRA as won't fix.

        Is that ok with you as well, Sean Mackrory?

        Show
        Mark Grover added a comment - Thanks Gwen. In that case, I am ok with documenting it and resolving this JIRA as won't fix. Is that ok with you as well, Sean Mackrory ?
        Hide
        John Omernik added a comment -

        What about this? Leave it case sensitive as it is, but provide a HIVE Variable that allows it to be set globally. That way, an administrator can set the behaivior of LIKE based on what their users are migrating in from. I.e. if MSSQL or MYSQL then set it to be case insensitive, else use the default of case sensitive. The issue here is one of transitioning and potential false negatives because of the assumption. Correct, you can't be consistent with every SQL implementation out there, however, I think there is some precedence for being LIKE '%mySql%'(see what I did there) which is case insensitive by default.

        Show
        John Omernik added a comment - What about this? Leave it case sensitive as it is, but provide a HIVE Variable that allows it to be set globally. That way, an administrator can set the behaivior of LIKE based on what their users are migrating in from. I.e. if MSSQL or MYSQL then set it to be case insensitive, else use the default of case sensitive. The issue here is one of transitioning and potential false negatives because of the assumption. Correct, you can't be consistent with every SQL implementation out there, however, I think there is some precedence for being LIKE '%mySql%'(see what I did there) which is case insensitive by default.
        Hide
        Edward Capriolo added a comment -

        Global conf variables that affect something like this are a bit scary. (It is also not common in how hive works now, global vars only effect performance usually not results. In the end 'like' is a UDF, I think the best solution is to create mlike 'mysql like' and implement that to work as desired.

        Show
        Edward Capriolo added a comment - Global conf variables that affect something like this are a bit scary. (It is also not common in how hive works now, global vars only effect performance usually not results. In the end 'like' is a UDF, I think the best solution is to create mlike 'mysql like' and implement that to work as desired.
        Hide
        John Omernik added a comment -

        They may be a bit scary, but from what perspective? The programming of the system or the users/operators of the data warehouse that need consistency in results? The scary thing about the LIKE operator and it not being able to be controlled in it's behavior is ensuring that assumptions by users don't lead to false negative hits. I know personally, I came from MySQl/MSSQL and I had that happened to me until I figured out the difference. It isn't well documented, I can't control it, it bit me. Ok, no problem, how do I fix it now in my environment, most of my users are from that same background, so now I have to do training to change this? How effective will that be? Will people forget? Will I get false negatives?

        A "mlike" may be an option, however, I am not sure; it still requires the user to remember there is a change, is mlike better than other work arounds? Not if the user forgets to use it and in the case of like, the end result is false negatives(user forgets, no results, moves on not realizing they are missing something).

        If a global option isn't available, could like be case insensitive and an olike function (oracle like) be added? I will say that while false positives are a pain, at least you can look at the results and say "wait, why isn't this case sensitive?" and you know something is different than what you are used to (i.e. Oracle, hence olike). You see and then understand there is a discrepancy. With mlike and case sensitivity being the default, you don't have any indicator, because unexpected case sensitivity produces false negatives, unexpected case insensitivity produces false positives. In analytics false positives are far preferred than false negatives.

        Just brainstorming here, what if we implemented mlike AND olike (my/ms SQL like and oracle like) (or cslike and cilike (case sensitive and case insensitive)) Then we have the like function just be a link to mlike or olike. Since Hive started out with case sensitivity, we could have like link to olike, but it can be changed to mlike in the hive-site. The point being, absence of specification, default to normal (LIKE = OLIKE Case sensitive) hive.like.insensitive = true make it MLIKE. Simple, doesn't change the default hive setting/behavior (out of the box), gives the option of global setting, and from a programing standpoint we are doing an if test on what LIKE links to, and we'd have olike and mlike available as UDFs when a user wanted to do the opposite of what the default setting is.

        Show
        John Omernik added a comment - They may be a bit scary, but from what perspective? The programming of the system or the users/operators of the data warehouse that need consistency in results? The scary thing about the LIKE operator and it not being able to be controlled in it's behavior is ensuring that assumptions by users don't lead to false negative hits. I know personally, I came from MySQl/MSSQL and I had that happened to me until I figured out the difference. It isn't well documented, I can't control it, it bit me. Ok, no problem, how do I fix it now in my environment, most of my users are from that same background, so now I have to do training to change this? How effective will that be? Will people forget? Will I get false negatives? A "mlike" may be an option, however, I am not sure; it still requires the user to remember there is a change, is mlike better than other work arounds? Not if the user forgets to use it and in the case of like, the end result is false negatives(user forgets, no results, moves on not realizing they are missing something). If a global option isn't available, could like be case insensitive and an olike function (oracle like) be added? I will say that while false positives are a pain, at least you can look at the results and say "wait, why isn't this case sensitive?" and you know something is different than what you are used to (i.e. Oracle, hence olike). You see and then understand there is a discrepancy. With mlike and case sensitivity being the default, you don't have any indicator, because unexpected case sensitivity produces false negatives, unexpected case insensitivity produces false positives. In analytics false positives are far preferred than false negatives. Just brainstorming here, what if we implemented mlike AND olike (my/ms SQL like and oracle like) (or cslike and cilike (case sensitive and case insensitive)) Then we have the like function just be a link to mlike or olike. Since Hive started out with case sensitivity, we could have like link to olike, but it can be changed to mlike in the hive-site. The point being, absence of specification, default to normal (LIKE = OLIKE Case sensitive) hive.like.insensitive = true make it MLIKE. Simple, doesn't change the default hive setting/behavior (out of the box), gives the option of global setting, and from a programing standpoint we are doing an if test on what LIKE links to, and we'd have olike and mlike available as UDFs when a user wanted to do the opposite of what the default setting is.
        Hide
        Mark Grover added a comment -

        I agree with Edward that having a property like that is risky. I also like John's idea of having two different like UDFs. We can only change the behavior (if at all) in a major release. I think in this case, we are just going to have to pick one kind of like and implement another one. Yeah, people coming from MySQL land may get confused but that's the best we can do is document it and ask them to use the other like.

        John, it would be great if you could create a JIRA for the case-insensitive like. Thanks!

        Show
        Mark Grover added a comment - I agree with Edward that having a property like that is risky. I also like John's idea of having two different like UDFs. We can only change the behavior (if at all) in a major release. I think in this case, we are just going to have to pick one kind of like and implement another one. Yeah, people coming from MySQL land may get confused but that's the best we can do is document it and ask them to use the other like. John, it would be great if you could create a JIRA for the case-insensitive like. Thanks!
        Hide
        John Omernik added a comment -

        Edward and Mark: Can you please explain risky (from a programming PoV)in the situations you are talking about? I am not a programmer, so I am looking at this from both and end-user and from the point of view of an organization that is going to implement Hive to assist with business decisions.

        When I see the two risks, i.e. my understanding of the programming risk (Which I am admitting I have very little understanding, hence the asking for the explanation) vs. the implementation risk i.e. organizations that are migrating to Hive, I worry that those decisions, which are based on false negatives because of the current choices (case sensitivity), could be very bad from a business/implementation risk standpoint.

        Please do not take my persistent questioning as being combative, I am just trying to look at this purely from a user/organization point of view. I.e. training people to do things different, the results from poor training or employee turnover, etc causing these false negatives on data. There has to be an elegant way to give the organization control over this... even documenting such a setting for administrators is nice, but it still relies on user training/choices to ensure business decisions are not based on false negatives. If we had a setting we could change, we could point out the difference, and give users/org and option that fits with each org's use case.

        As I said, just trying to understand about the risk on the programming side, I see the risk, as a power user, administrator etc as having a high likelihood of occurring, and the impact of said risk (false negatives) as being high as well. That said, I defer to you guys on the programming side.

        Show
        John Omernik added a comment - Edward and Mark: Can you please explain risky (from a programming PoV)in the situations you are talking about? I am not a programmer, so I am looking at this from both and end-user and from the point of view of an organization that is going to implement Hive to assist with business decisions. When I see the two risks, i.e. my understanding of the programming risk (Which I am admitting I have very little understanding, hence the asking for the explanation) vs. the implementation risk i.e. organizations that are migrating to Hive, I worry that those decisions, which are based on false negatives because of the current choices (case sensitivity), could be very bad from a business/implementation risk standpoint. Please do not take my persistent questioning as being combative, I am just trying to look at this purely from a user/organization point of view. I.e. training people to do things different, the results from poor training or employee turnover, etc causing these false negatives on data. There has to be an elegant way to give the organization control over this... even documenting such a setting for administrators is nice, but it still relies on user training/choices to ensure business decisions are not based on false negatives. If we had a setting we could change, we could point out the difference, and give users/org and option that fits with each org's use case. As I said, just trying to understand about the risk on the programming side, I see the risk, as a power user, administrator etc as having a high likelihood of occurring, and the impact of said risk (false negatives) as being high as well. That said, I defer to you guys on the programming side.
        Hide
        Edward Capriolo added a comment -

        The risk, as I see it, is that there are already a large number of people depending on the current behaviour. If we change the default that would change the results current users are getting. Better that the new users learn how hive works, since they are learning anyway, then break assumptions of current users. Most users do not want to have to heavily test before upgrade, they want consistent behaviour between versions.

        You suggestion to have a global or session level property is a good one. There are some cases where I have thought about doing this. In general, it is not ideal because no other component in hive works this way. Thus having a one-off configuration for handling how like statements work is odd. Also the query is no longer self documenting. Based on how some parameter outside the query is set, the system functions differently. Imagine if we had 10 such parameters could the same query produce 100 different results based on permutations of properties?

        For the most part, we model functionality in hive based on what mysql does. You will find a lot of compatibility in how UDFs work and other language features.

        There are many ways this can be dealt with, hive has 'like' and 'rlike'. If there is an sql standard on how like must work that might be ammo for the argument of changing the default, but basing a change solely on how mysql does something just for new users is not attractive. MySQL has made its own bad choices over the years (non standard things like enum) (non standard date/time types) (non standard ways to specify indexes).

        I am guessing that hive's like is the way it is because hive initially only supported java's UTf8 strings and that comparison is by default case sensitive.

        Show
        Edward Capriolo added a comment - The risk, as I see it, is that there are already a large number of people depending on the current behaviour. If we change the default that would change the results current users are getting. Better that the new users learn how hive works, since they are learning anyway, then break assumptions of current users. Most users do not want to have to heavily test before upgrade, they want consistent behaviour between versions. You suggestion to have a global or session level property is a good one. There are some cases where I have thought about doing this. In general, it is not ideal because no other component in hive works this way. Thus having a one-off configuration for handling how like statements work is odd. Also the query is no longer self documenting. Based on how some parameter outside the query is set, the system functions differently. Imagine if we had 10 such parameters could the same query produce 100 different results based on permutations of properties? For the most part, we model functionality in hive based on what mysql does. You will find a lot of compatibility in how UDFs work and other language features. There are many ways this can be dealt with, hive has 'like' and 'rlike'. If there is an sql standard on how like must work that might be ammo for the argument of changing the default, but basing a change solely on how mysql does something just for new users is not attractive. MySQL has made its own bad choices over the years (non standard things like enum) (non standard date/time types) (non standard ways to specify indexes). I am guessing that hive's like is the way it is because hive initially only supported java's UTf8 strings and that comparison is by default case sensitive.
        Hide
        John Omernik added a comment -

        A couple of points to consider:
        1. " If we change the default that would change the results current users are getting." I am in absolute agreement here. I do not wish to change the default, the horse has left the gate. If I have referenced that in my previous posts, it's in the context of other ideas etc. My issue with the current default isn't the choice of what it is, as much as there is no way to change it on a case by base basis.

        2. Paragraph 2: Those are solid points based on risks to the ongoing maintenance of Hive. That said, most Relational Databases, including MySQL have the option to change collation for given databases. True it may not be global setting (although the collation default is a global setting), but rather a per database setting. I.e. in MySQL you can have one database use a case sensitive collation and another use a case insensitive collation. Perhaps hive-site isn't the place for this, but metadata is (Set it as a per database or table setting?)

        3. For the most part, we model functionality in hive based on what mysql does - Because of Point 1, this is moot, but by default, MySQL is case insensitive.

        4. Most users do not want to have to heavily test before upgrade, they want consistent behaviour between versions. - This is exactly why I think we need an option for administrators of Hive to be able to set the case sensitivity on database/table level, and the default case sensitivity at a global level. Most users do not test. From a philosophical (data philosopher?) point of view, users will not test, therefore if failure is to happen because of lack of testing, please fail with pomp and circumstance.

        I.e. Both lines of users (MySQL/MSSQL Migrating to Hive and Oracle/Postgres Migrating to Hive) are going to make assumptions when they write their queries that will cause a failure to happen. If LIKE is Case Sensitive as it is, Oracle/Postgres users will assume correctly and all will be well. MySQL/Users will assume wrong, but they won't KNOW they assumed wrong, they will just assume no results (given a lack of testing). The converse is actually the preferred model: If LIKE is not case sensitive, then Oracle/Postgres users will assume incorrectly, they will run a query, and they will get their expected results, but will also get extra results allowing them to understand they assumed wrong. MySQL/MSSQL users will assume correctly and get their results.

        Going back to Point 1, the horse has left the building, we can't change default because of what it would do to the current user base, thus the next best option is to allow administrators to set that so users don't have to deal with it, document it so good administrators can handle it out of the gate, and to find a way to do it like '%MysQL%' (i.e. similar to collation settings on databases rather then a global setting as described previously).

        I do see the challenges with a global setting affecting how a function works; no precedence for that, and thus it could introduce risk, but I still hold that the current risk, especially on a system that is touted as modeled after MySQL (that's how it was explained to me, thus I assumed LIKE was case insensitive), to users is high. As I was typing this novelette, I realized I tossed out an idea related to per database settings. Thoughts on that? That is similar to how MySQL handles it, and thus follows the models mentioned while avoiding a global setting that affects the behavior of a UDF. (The setting doesn't change the UDF, the "collation equivalent" setting on the databases does, and thus the global setting is just the default collation equivalent.

        Show
        John Omernik added a comment - A couple of points to consider: 1. " If we change the default that would change the results current users are getting." I am in absolute agreement here. I do not wish to change the default, the horse has left the gate. If I have referenced that in my previous posts, it's in the context of other ideas etc. My issue with the current default isn't the choice of what it is, as much as there is no way to change it on a case by base basis. 2. Paragraph 2: Those are solid points based on risks to the ongoing maintenance of Hive. That said, most Relational Databases, including MySQL have the option to change collation for given databases. True it may not be global setting (although the collation default is a global setting), but rather a per database setting. I.e. in MySQL you can have one database use a case sensitive collation and another use a case insensitive collation. Perhaps hive-site isn't the place for this, but metadata is (Set it as a per database or table setting?) 3. For the most part, we model functionality in hive based on what mysql does - Because of Point 1, this is moot, but by default, MySQL is case insensitive. 4. Most users do not want to have to heavily test before upgrade, they want consistent behaviour between versions. - This is exactly why I think we need an option for administrators of Hive to be able to set the case sensitivity on database/table level, and the default case sensitivity at a global level. Most users do not test. From a philosophical (data philosopher?) point of view, users will not test, therefore if failure is to happen because of lack of testing, please fail with pomp and circumstance. I.e. Both lines of users (MySQL/MSSQL Migrating to Hive and Oracle/Postgres Migrating to Hive) are going to make assumptions when they write their queries that will cause a failure to happen. If LIKE is Case Sensitive as it is, Oracle/Postgres users will assume correctly and all will be well. MySQL/Users will assume wrong, but they won't KNOW they assumed wrong, they will just assume no results (given a lack of testing). The converse is actually the preferred model: If LIKE is not case sensitive, then Oracle/Postgres users will assume incorrectly, they will run a query, and they will get their expected results, but will also get extra results allowing them to understand they assumed wrong. MySQL/MSSQL users will assume correctly and get their results. Going back to Point 1, the horse has left the building, we can't change default because of what it would do to the current user base, thus the next best option is to allow administrators to set that so users don't have to deal with it, document it so good administrators can handle it out of the gate, and to find a way to do it like '%MysQL%' (i.e. similar to collation settings on databases rather then a global setting as described previously). I do see the challenges with a global setting affecting how a function works; no precedence for that, and thus it could introduce risk, but I still hold that the current risk, especially on a system that is touted as modeled after MySQL (that's how it was explained to me, thus I assumed LIKE was case insensitive), to users is high. As I was typing this novelette, I realized I tossed out an idea related to per database settings. Thoughts on that? That is similar to how MySQL handles it, and thus follows the models mentioned while avoiding a global setting that affects the behavior of a UDF. (The setting doesn't change the UDF, the "collation equivalent" setting on the databases does, and thus the global setting is just the default collation equivalent.
        Hide
        Eric Hanson added a comment -

        There's an effort to add vectorized query execution to Hive, and this JIRA is to speed up the most common cases for LIKE in vectorized mode: https://issues.apache.org/jira/browse/HIVE-4548.

        Making the performance as fast as in the HIVE-4548 JIRA will not be possible if LIKE is case-insensitive. That makes things quite a bit harder. So I am glad the sentiment here is to keep the default behavior case-sensitive.

        Show
        Eric Hanson added a comment - There's an effort to add vectorized query execution to Hive, and this JIRA is to speed up the most common cases for LIKE in vectorized mode: https://issues.apache.org/jira/browse/HIVE-4548 . Making the performance as fast as in the HIVE-4548 JIRA will not be possible if LIKE is case-insensitive. That makes things quite a bit harder. So I am glad the sentiment here is to keep the default behavior case-sensitive.

          People

          • Assignee:
            Mark Grover
            Reporter:
            Mark Grover
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development