Hive
  1. Hive
  2. HIVE-1287

Struct datatype should not use field names for type equivalence.

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None
    • Environment:

      Mac OS X (10.6.2) Java SE 6 ( 1.6.0_17)

      Description

      The field names for Struct types are currently being matched for testing type equivalence. This is readily seen by running the following example:

      hive> create table source ( foo struct < x : string > );
      OK
      Time taken: 3.094 seconds
      
      hive> load data local inpath '/path/to/sample/data.txt' overwrite into table source;
      Copying data from file:/path/to/sample/data.txt
      Loading data to table source
      OK
      Time taken: 0.593 seconds
      
      hive> create table sink ( bar struct < y : string >);
      OK
      Time taken: 0.11 seconds
      
      hive> insert overwrite table sink select foo from source;
      FAILED: Error in semantic analysis: line 1:23 Cannot insert into target table 
      because column number/types are different sink: Cannot convert column 0 
      from struct<x:string> to struct<y:string>.
      
      

      Since both soruce.foo and sink.bar are similar in definition with only field names being different, data movement between these two should be allowed.

        Activity

        Arvind Prabhakar created issue -
        Hide
        Zheng Shao added a comment -

        I think we should support the following query:

        insert overwrite table sink select CAST(foo AS struct<y: string>) from source;
        

        This is better than directly converting them, because there can be confusions (There are 2 ways to convert from struct<x: string, y: string> and struct<y: string, x: string>, and Hive is taking one of them).

        Show
        Zheng Shao added a comment - I think we should support the following query: insert overwrite table sink select CAST(foo AS struct<y: string>) from source; This is better than directly converting them, because there can be confusions (There are 2 ways to convert from struct<x: string, y: string> and struct<y: string, x: string>, and Hive is taking one of them).
        Hide
        Arvind Prabhakar added a comment -

        Thanks for your comment Zheng.

        I can see how the CAST would work, but believe that we need a stronger type checking semantic. Traditionally, a CAST is used to bypass compile time checks. While this is very powerful concept, it can lead to data corrpution if not used with caution.

        An alternative to using the CAST approach would be to use compile time type checking without regard to the field names. This is similar to function signatures in say Java - where it does not matter what the parameter names are, as long as they are specified in the correct order. This can be achieved by thinking of field names as aliases for the datatypes of that field.

        For example - the columns defined as struct < a : string > and struct < b : string > are type-equivalent because they are both of the type struct < ? : string >.

        Show
        Arvind Prabhakar added a comment - Thanks for your comment Zheng. I can see how the CAST would work, but believe that we need a stronger type checking semantic. Traditionally, a CAST is used to bypass compile time checks. While this is very powerful concept, it can lead to data corrpution if not used with caution. An alternative to using the CAST approach would be to use compile time type checking without regard to the field names. This is similar to function signatures in say Java - where it does not matter what the parameter names are, as long as they are specified in the correct order. This can be achieved by thinking of field names as aliases for the datatypes of that field. For example - the columns defined as struct < a : string > and struct < b : string > are type-equivalent because they are both of the type struct < ? : string > .
        Hide
        Zheng Shao added a comment -

        > Traditionally, a CAST is used to bypass compile time checks. While this is very powerful concept, it can lead to data corrpution if not used with caution.

        The semantics of type equivalence as you mentioned is weaker than the current one. It can also lead to data corrpution if not used with caution.
        Asking users to use "CAST" is safer than implicitly treating struct<a:string> and struct<b:string> to be the same type.

        Does that make sense?

        Show
        Zheng Shao added a comment - > Traditionally, a CAST is used to bypass compile time checks. While this is very powerful concept, it can lead to data corrpution if not used with caution. The semantics of type equivalence as you mentioned is weaker than the current one. It can also lead to data corrpution if not used with caution. Asking users to use "CAST" is safer than implicitly treating struct<a:string> and struct<b:string> to be the same type. Does that make sense?
        Hide
        Arvind Prabhakar added a comment -

        I think I understand your point of view. Let me explain mine:

        Right now there is no consistent type checking. What we have is implicit type conversion where possible - such as converting a struct to string but not the other way around. In other places this implicit type conversion leads to internal error. In case of struct to struct conversion however the check is rigid to the field names. This is not consistent.

        My suggestion is to provide type equivalence semantics within the query language framework. Doing this will help in the following ways:

        • Implicit type conversion would not be allowed and would require explicit CAST to convert to another type.
        • The query compiler would ensure that the data types are equivalent and therefore allow data to flow without having to invoke any UDF for every row. This should help us gain performance relative to the current approach.
        • Providing type equivalence checks will also be fundamental to building higher-level UD*Fs which would otherwise have to deal with cast semantics.
        Show
        Arvind Prabhakar added a comment - I think I understand your point of view. Let me explain mine: Right now there is no consistent type checking. What we have is implicit type conversion where possible - such as converting a struct to string but not the other way around. In other places this implicit type conversion leads to internal error. In case of struct to struct conversion however the check is rigid to the field names. This is not consistent. My suggestion is to provide type equivalence semantics within the query language framework. Doing this will help in the following ways: Implicit type conversion would not be allowed and would require explicit CAST to convert to another type. The query compiler would ensure that the data types are equivalent and therefore allow data to flow without having to invoke any UDF for every row. This should help us gain performance relative to the current approach. Providing type equivalence checks will also be fundamental to building higher-level UD*Fs which would otherwise have to deal with cast semantics.

          People

          • Assignee:
            Unassigned
            Reporter:
            Arvind Prabhakar
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development