Pig
  1. Pig
  2. PIG-2537

Output from flatten with a null tuple input generating data inconsistent with the schema

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0, 0.9.0
    • Fix Version/s: 0.13.0
    • Component/s: impl
    • Labels:
      None

      Description

      For the following pig script,

      grunt> A = load 'file' as ( a : tuple( x, y, z ), b, c );
      grunt> B = foreach A generate flatten( $0 ), b, c;
      grunt> describe B;
      B:

      {a::x: bytearray,a::y: bytearray,a::z: bytearray,b: bytearray,c: bytearray}

      Alias B has a clear schema.

      However, on the backend, for a row if $0 happens to be null, then output tuple become something like
      (null, b_value, c_value), which is obviously inconsistent with the schema. The behaviour is confirmed by pig code inspection.

      This inconsistency corrupts data because of position shifts. Expected output row should be something like
      (null, null, null, b_value, c_value).

      1. PIG-2537-1.patch
        17 kB
        Daniel Dai
      2. PIG-2537-2.patch
        27 kB
        Daniel Dai
      3. PIG-2537-3.patch
        44 kB
        Daniel Dai

        Activity

        Hide
        Daniel Dai added a comment -

        In the case schema is given, we shall certainly read data according to schema. Here what we shall read: ((null, null, null), b_value, c_value).

        Show
        Daniel Dai added a comment - In the case schema is given, we shall certainly read data according to schema. Here what we shall read: ((null, null, null), b_value, c_value).
        Hide
        Daniel Dai added a comment -

        PIG-2537-2.patch fix unit test failures.

        Show
        Daniel Dai added a comment - PIG-2537 -2.patch fix unit test failures.
        Hide
        Daniel Dai added a comment -

        PIG-2537-3.patch fix another unit test failure.

        Show
        Daniel Dai added a comment - PIG-2537 -3.patch fix another unit test failure.
        Hide
        Daniel Dai added a comment -

        Discussed with Thejas, Alan, we might need to research more and find the best way to solve the problem. Unlink from 0.10.

        Show
        Daniel Dai added a comment - Discussed with Thejas, Alan, we might need to research more and find the best way to solve the problem. Unlink from 0.10.
        Hide
        Thejas M Nair added a comment -

        Thoughts on the solution - Pig should continue to allow and expect null values for objects such as tuple. I think the problem needs to be solved in flatten, as it is the one that promises a certain schema and fails to generate data of that schema if the value is null. But this means that flatten needs to be aware of the expected schema for the tuple/bags at run time, ie the schema would need to be serialized and sent to the backend. That change would also be non backward compatible.

        Show
        Thejas M Nair added a comment - Thoughts on the solution - Pig should continue to allow and expect null values for objects such as tuple. I think the problem needs to be solved in flatten, as it is the one that promises a certain schema and fails to generate data of that schema if the value is null. But this means that flatten needs to be aware of the expected schema for the tuple/bags at run time, ie the schema would need to be serialized and sent to the backend. That change would also be non backward compatible.
        Hide
        Julien Le Dem added a comment -

        It does not seem it will be in 0.11 either

        Show
        Julien Le Dem added a comment - It does not seem it will be in 0.11 either
        Hide
        Peter Connolly added a comment -

        As a workaround, I'm able to move the FLATTEN operator to the rightmost column and then run a second generate on all of the fields to fix this problem. I'm only dealing with two columns in the tuple, so I'm not sure it will work with more columns.

        Using the example above, it might look something like this:
        grunt> A = load 'file' as ( a : tuple( x, y, z ), b, c );
        --B will have a variable number of null columns on the right side, but columns b and c will be correct
        grunt> B = foreach A generate b, c, flatten( $0 ) AS (x,y,z);
        --Running another foreach inserts null values for the extra columns
        grunt> C = foreach B generate b,c,x,y,z;

        Show
        Peter Connolly added a comment - As a workaround, I'm able to move the FLATTEN operator to the rightmost column and then run a second generate on all of the fields to fix this problem. I'm only dealing with two columns in the tuple, so I'm not sure it will work with more columns. Using the example above, it might look something like this: grunt> A = load 'file' as ( a : tuple( x, y, z ), b, c ); --B will have a variable number of null columns on the right side, but columns b and c will be correct grunt> B = foreach A generate b, c, flatten( $0 ) AS (x,y,z); --Running another foreach inserts null values for the extra columns grunt> C = foreach B generate b,c,x,y,z;

          People

          • Assignee:
            Daniel Dai
            Reporter:
            Xuefu Zhang
          • Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development