Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1461

support union operation that merges based on column names

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0
    • 0.8.0
    • impl
    • None
    • Reviewed
    • Hide
      Documentation for UNION ONSCHEMA:

      Use the keyword ONSCHEMA with union so that the union is based on column names of the input relations, and not column position.
      If the following requirements are not met, the statement will throw an error :

          * All inputs to the union should have a non null schema.
          * The data type for columns with same name in different input schemas should be compatible. Numeric types are compatible, and if column having same name in different input schemas have different numeric types , an implicit conversion will happen. bytearray type is considered compatible with all other types, a cast will be added to convert to other type. Bags or tuples having different inner schema are considered incompatible.

      Example -

      grunt> L1 = load 'f1' using (a : int, b : float);
      grunt> dump L1;
      (11,12.0)
      (21,22.0)

      grunt> L2 = load 'f1' using (a : long, c : chararray);
      grunt> dump L2;
      (11,a)
      (12,b)
      (13,c)

      grunt> U = union onschema L1, L2;
      grunt> describe U ;
      U : {a : long, b : float, c : chararray}

      grunt> dump U;
      (11,12.0,)
      (21,22.0,)
      (11,,a)
      (12,,b)
      (13,,c)

      Note:
      - Alias such as 'nm::c1' and 'c1' in two separate relations specified in 'union onschema' are considered mergeable and in the schema of the union, the merged column alias will be 'c1'.
      - Alias such as 'nm1::c1' and 'nm2::c1' in two separate relations specified in 'union onschema' will not be merged together, in schema of the union there will be two columns with these names.

      Example -

      > describe f;
      f: {l1::a: int, l1::b: int, l1::c: int}
      > describe l1;
      l1: {a: int, b: int}

      > u = union onschema f,l1;
      > desc u;
      u: {a: int, b: int, l1::c: int}

      Like the default union, 'union onschema' also supports 2 or more inputs.

      Show
      Documentation for UNION ONSCHEMA: Use the keyword ONSCHEMA with union so that the union is based on column names of the input relations, and not column position. If the following requirements are not met, the statement will throw an error :     * All inputs to the union should have a non null schema.     * The data type for columns with same name in different input schemas should be compatible. Numeric types are compatible, and if column having same name in different input schemas have different numeric types , an implicit conversion will happen. bytearray type is considered compatible with all other types, a cast will be added to convert to other type. Bags or tuples having different inner schema are considered incompatible. Example - grunt> L1 = load 'f1' using (a : int, b : float); grunt> dump L1; (11,12.0) (21,22.0) grunt> L2 = load 'f1' using (a : long, c : chararray); grunt> dump L2; (11,a) (12,b) (13,c) grunt> U = union onschema L1, L2; grunt> describe U ; U : {a : long, b : float, c : chararray} grunt> dump U; (11,12.0,) (21,22.0,) (11,,a) (12,,b) (13,,c) Note: - Alias such as 'nm::c1' and 'c1' in two separate relations specified in 'union onschema' are considered mergeable and in the schema of the union, the merged column alias will be 'c1'. - Alias such as 'nm1::c1' and 'nm2::c1' in two separate relations specified in 'union onschema' will not be merged together, in schema of the union there will be two columns with these names. Example - > describe f; f: {l1::a: int, l1::b: int, l1::c: int} > describe l1; l1: {a: int, b: int} > u = union onschema f,l1; > desc u; u: {a: int, b: int, l1::c: int} Like the default union, 'union onschema' also supports 2 or more inputs.

    Description

      When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns.
      The behavior of existing union operator should remain backward compatible .

      This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ?

      example -

      L1 = load 'x' as (a,b);
      L2 = load 'y' as (b,c);
      U = unionschema L1, L2;

      describe U;
      U:

      {a:bytearray, b:byetarray, c:bytearray}

      Attachments

        1. PIG-1461.1.patch
          31 kB
          Thejas Nair
        2. PIG-1461.2.patch
          32 kB
          Thejas Nair
        3. PIG-1461.patch
          29 kB
          Thejas Nair

        Activity

          People

            thejas Thejas Nair
            thejas Thejas Nair
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: