Pig
  1. Pig
  2. PIG-1536

use same logic for merging inner schemas in "default union" and "union onschema"

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      The behavior of union and uniononschema converges after this patch:
      1. Union on relation of two different size result a null schema (union only):
      A: (a1:long, a2:long)
      B: (b1:long, b2:long, b3:long)
      A union B: null

      2. Union column of incompatible type result a bytearray type:
      A: (a1:long, a2:long)
      B: (b1:(b11:long, b12:long), b2:long)
      A union B: (a1:bytearray, a2:long)

      3. Union column of compatible type will produce a escalate the type. The priority is chararray -> double -> float -> long -> int -> bytearray:
      A: (a1:int, a2:double, a3:int)
      B: (b1:float, b2:chararray, b3:bytearray)
      A union B: (a1:float, a2:chararray, a3:int)

      4. Union different inner type result an empty complex type:
      A: (a1:(a11:long, a12:int), a2:{(a21:charray, a22:int)})
      B: (b1:(b11:int, b12:int), b2:{(b21:int, b22:int)})
      A union B: (a1:(), a2:{()})

      5. Always take the alias of first relation as the alias of unioned relation field
      Show
      The behavior of union and uniononschema converges after this patch: 1. Union on relation of two different size result a null schema (union only): A: (a1:long, a2:long) B: (b1:long, b2:long, b3:long) A union B: null 2. Union column of incompatible type result a bytearray type: A: (a1:long, a2:long) B: (b1:(b11:long, b12:long), b2:long) A union B: (a1:bytearray, a2:long) 3. Union column of compatible type will produce a escalate the type. The priority is chararray -> double -> float -> long -> int -> bytearray: A: (a1:int, a2:double, a3:int) B: (b1:float, b2:chararray, b3:bytearray) A union B: (a1:float, a2:chararray, a3:int) 4. Union different inner type result an empty complex type: A: (a1:(a11:long, a12:int), a2:{(a21:charray, a22:int)}) B: (b1:(b11:int, b12:int), b2:{(b21:int, b22:int)}) A union B: (a1:(), a2:{()}) 5. Always take the alias of first relation as the alias of unioned relation field

      Description

      We should consider using logic for merging inner schema in case of the two different types of union.

      In case of 'default union', it merges the two inner schema of bags/tuples by position if the number of fields are same and the corresponding types are compatible.

      In case of 'union onschema', it considers tuple/bag with different innerschema to be incompatible types.

      1. PIG-1536-4.patch
        31 kB
        Daniel Dai
      2. PIG-1536-3.patch
        31 kB
        Daniel Dai
      3. PIG-1536-2.patch
        32 kB
        Daniel Dai
      4. PIG-1536-1.patch
        31 kB
        Daniel Dai

        Activity

        Hide
        Thejas M Nair added a comment -

        The way 'default union' deals with columns of different but compatible types in same position is not right. It creates a merged schema choosing a merged type, but there is not cast that happens to convert the rows to this type.
        eg -

        grunt> l1 = load '/tmp/f1' as (a : chararray, t (a : int, c : long) );
        grunt> l2 = load '/tmp/f1' as (a : chararray, t (a : int, b : int) ); 
        grunt> u = union l1, l2;                                              
        grunt> describe u;                                                    
        u: {a: chararray,t: (a: int,c: long)}
        
        -- the result of u, only the rows originating from l1 will correspond to schema shown in describe.
        
        MapReduce node 1-206
        Map Plan
        u: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-203
        |
        |---u: Union[bag] - 1-202
            |
            |---l1: New For Each(false,false)[bag] - 1-195
            |   |   |
            |   |   Cast[chararray] - 1-192
            |   |   |
            |   |   |---Project[bytearray][0] - 1-191
            |   |   |
            |   |   Cast[tuple:(int,long)] - 1-194
            |   |   |
            |   |   |---Project[bytearray][1] - 1-193
            |   |
            |   |---l1: Load(/tmp/f1:org.apache.pig.builtin.PigStorage) - 1-190
            |
            |---l2: New For Each(false,false)[bag] - 1-201
                |   |
                |   Cast[chararray] - 1-198
                |   |
                |   |---Project[bytearray][0] - 1-197
                |   |
                |   Cast[tuple:(int,int)] - 1-200
                |   |
                |   |---Project[bytearray][1] - 1-199
                |
                |---l2: Load(/tmp/f1:org.apache.pig.builtin.PigStorage) - 1-196--------
        Global sort: false
        ----------------
        
        
        Show
        Thejas M Nair added a comment - The way 'default union' deals with columns of different but compatible types in same position is not right. It creates a merged schema choosing a merged type, but there is not cast that happens to convert the rows to this type. eg - grunt> l1 = load '/tmp/f1' as (a : chararray, t (a : int , c : long ) ); grunt> l2 = load '/tmp/f1' as (a : chararray, t (a : int , b : int ) ); grunt> u = union l1, l2; grunt> describe u; u: {a: chararray,t: (a: int ,c: long )} -- the result of u, only the rows originating from l1 will correspond to schema shown in describe. MapReduce node 1-206 Map Plan u: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-203 | |---u: Union[bag] - 1-202 | |---l1: New For Each( false , false )[bag] - 1-195 | | | | | Cast[chararray] - 1-192 | | | | | |---Project[bytearray][0] - 1-191 | | | | | Cast[tuple:( int , long )] - 1-194 | | | | | |---Project[bytearray][1] - 1-193 | | | |---l1: Load(/tmp/f1:org.apache.pig.builtin.PigStorage) - 1-190 | |---l2: New For Each( false , false )[bag] - 1-201 | | | Cast[chararray] - 1-198 | | | |---Project[bytearray][0] - 1-197 | | | Cast[tuple:( int , int )] - 1-200 | | | |---Project[bytearray][1] - 1-199 | |---l2: Load(/tmp/f1:org.apache.pig.builtin.PigStorage) - 1-196-------- Global sort: false ----------------
        Hide
        Daniel Dai added a comment -

        PIG-1536-2.patch address Thejas's review comment.

        Show
        Daniel Dai added a comment - PIG-1536 -2.patch address Thejas's review comment.
        Hide
        Thejas M Nair added a comment -

        +1

        Show
        Thejas M Nair added a comment - +1
        Hide
        Daniel Dai added a comment -

        PIG-1536-3.patch resync with trunk.

        Show
        Daniel Dai added a comment - PIG-1536 -3.patch resync with trunk.
        Hide
        Daniel Dai added a comment -

        Patch committed to trunk.

        Show
        Daniel Dai added a comment - Patch committed to trunk.
        Hide
        Daniel Dai added a comment -
        Show
        Daniel Dai added a comment - Review notes: https://reviews.apache.org/r/387/

          People

          • Assignee:
            Daniel Dai
            Reporter:
            Thejas M Nair
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development