Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      When the union type was introduced, full support for it wasn't provided. For instance, when working with a union that gets passed to LazyBinarySerde:

      Caused by: java.lang.RuntimeException: Unrecognized type: UNION
      	at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:468)
      	at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serializeStruct(LazyBinarySerDe.java:230)
      	at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:184)
      

        Issue Links

          Activity

          Hide
          Jakob Homan added a comment -

          Part of the problem is that the term union has been overloaded. In SQL it means the actual set union of two compatible data types, whereas in Avro and programming languages it means one value that can be at any one time an instance of two or different types. Union was added as a full-on, first-class type by its inclusion in ObjectInspector's Category enum. Is there any reason not to expand this use to be more along the line of programming language's take on unions? If so, it should be marked as not really being a first-class type. If not, support for unions in all the serdes, in the grammar and in the documentation should be provided.

          I would lobby for expanding its support as it's an important type in Avro and we're quite hobbled by the inability to manipulate unioned values. (Avro handles nullable values by unioning them with their type T and null, but Haivvreo transparently converts these just to the type and returns null where appropriate. The problem lies in actual unions of non-null types, which are less frequent but still valid.)

          Show
          Jakob Homan added a comment - Part of the problem is that the term union has been overloaded. In SQL it means the actual set union of two compatible data types, whereas in Avro and programming languages it means one value that can be at any one time an instance of two or different types. Union was added as a full-on, first-class type by its inclusion in ObjectInspector's Category enum. Is there any reason not to expand this use to be more along the line of programming language's take on unions? If so, it should be marked as not really being a first-class type. If not, support for unions in all the serdes, in the grammar and in the documentation should be provided. I would lobby for expanding its support as it's an important type in Avro and we're quite hobbled by the inability to manipulate unioned values. (Avro handles nullable values by unioning them with their type T and null, but Haivvreo transparently converts these just to the type and returns null where appropriate. The problem lies in actual unions of non-null types, which are less frequent but still valid.)
          Hide
          Jakob Homan added a comment -

          Changing name of JIRA to be more representative of what needs to be done. If reaction is positive, will open subtasks for individual items.

          Show
          Jakob Homan added a comment - Changing name of JIRA to be more representative of what needs to be done. If reaction is positive, will open subtasks for individual items.
          Hide
          Amareshwari Sriramadasu added a comment -

          +1. I agree that when Union type was added, complete support for it was not added. We should extend its usage in all the serdes.

          Part of the problem is that the term union has been overloaded.

          The type is called 'uniontype' in Hive to resolve ambiguities.

          Show
          Amareshwari Sriramadasu added a comment - +1. I agree that when Union type was added, complete support for it was not added. We should extend its usage in all the serdes. Part of the problem is that the term union has been overloaded. The type is called 'uniontype' in Hive to resolve ambiguities.
          Hide
          Navis added a comment -

          HIVE-4765 included LazyBinaryUnion type. Could you check that?

          Show
          Navis added a comment - HIVE-4765 included LazyBinaryUnion type. Could you check that?
          Hide
          chewie added a comment -

          I wanted to see about the current status, and if there are any ETAs for resolution? I can assure there are quite a few efforts needing to qualify on data within uniontypes in Hive (Impala, etc), as soon as possible. I've been informed my effort will not accept uniontype usage (with more than one non-null type) unless there is built-in Hive support (which is very unfortunate, but not without point)... meaning the types have to be split into separate fields, which obviously is less semantically correct, more clunky (in the Avro model and Java), and provides no benefit other than a workaround for clean query ability.

          Something else that needs addressed is how to reference nested fields / structs / etc in the query. Currently '.' (period) is used, can this be kept for union? Ambiguity can arise if more than one type has the same field, in all other cases it can be implicitly unambiguous. This could actually be validated before query execution. When more than one type could have the same field, what would the syntax be? Possibly:

          unionobject.object.[2]unionobject.unionobject.[1]unionobject.object.....
          

          In the above example, any ambiguous object types being reference can be qualified by the int value of the type in square brackets [].

          Show
          chewie added a comment - I wanted to see about the current status, and if there are any ETAs for resolution? I can assure there are quite a few efforts needing to qualify on data within uniontypes in Hive (Impala, etc), as soon as possible. I've been informed my effort will not accept uniontype usage (with more than one non-null type) unless there is built-in Hive support (which is very unfortunate, but not without point)... meaning the types have to be split into separate fields, which obviously is less semantically correct, more clunky (in the Avro model and Java), and provides no benefit other than a workaround for clean query ability. Something else that needs addressed is how to reference nested fields / structs / etc in the query. Currently '.' (period) is used, can this be kept for union? Ambiguity can arise if more than one type has the same field, in all other cases it can be implicitly unambiguous. This could actually be validated before query execution. When more than one type could have the same field, what would the syntax be? Possibly: unionobject.object.[2]unionobject.unionobject.[1]unionobject.object..... In the above example, any ambiguous object types being reference can be qualified by the int value of the type in square brackets [].

            People

            • Assignee:
              Mohammad Kamrul Islam
              Reporter:
              Jakob Homan
            • Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development