Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-5404

FLATTEN infers wrong datatype

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.17.0
    • 0.18.0
    • impl
    • Reviewed
    • Important

    Description

      In version 0.12 (checked out branch-0.12) the following code works as expected:

      With the following input file test.csv:

       

      John_5,18,4.0F
      Mary_6,19,3.8F
      Bill_7,20,3.9F
      Joe_8,18,3.8F

       

       

      
      A = LOAD 'test.csv' USING PigStorage (',') AS (name:chararray,age:int,gpr:float);
      B = FOREACH A GENERATE FLATTEN(STRSPLIT(name,'_')) as (name1:chararray,name2:chararray),age,gpr;
      DESCRIBE B;

      and produces the following output:

       

      B: {name1: chararray,name2: chararray,age: int,gpr: float}
      

      This is the expected output as the result of flatten is defined as chararrays.

       

      When using version 0.17 (checkout out branch-0.17) the code produces:

      B: {name1: bytearray,name2: bytearray,age: int,gpr: float}
      

      This shows that somehow FLATTEN inferred wrong data types (bytearray instead of chararay).

       

      Using explicit casting as a workaround on 0.17:

      B1 = FOREACH B GENERATE (chararray)name1,(chararray)name2,age,gpr;
      DESCRIBE B1;

      produces

      B1: {name1: chararray,name2: chararray,age: int,gpr: float}
      

      This time with the expected data types.

       

      The plan explain show some strange cast operators that are not really used (or at least the actual data types are wrong):

      #-----------------------------------------------
      # New Logical Plan:
      #-----------------------------------------------
      B: (Name: LOStore Schema: name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)
      |
      |---B: (Name: LOForEach Schema: name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)
          |   |
          |   (Name: LOGenerate[false,false,false,false] Schema: name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)ColumnPrune:OutputUids=[121, 105, 122, 106]ColumnPrune:InputUids=[121, 105, 122, 106]
          |   |   |
          |   |   (Name: Cast Type: chararray Uid: 121)
          |   |   |
          |   |   |---name1:(Name: Project Type: bytearray Uid: 121 Input: 0 Column: 0)
          |   |   |
          |   |   (Name: Cast Type: chararray Uid: 122)
          |   |   |
          |   |   |---name2:(Name: Project Type: bytearray Uid: 122 Input: 1 Column: 0)
          |   |   |
          |   |   age:(Name: Project Type: int Uid: 105 Input: 2 Column: 0)
          |   |   |
          |   |   gpr:(Name: Project Type: float Uid: 106 Input: 3 Column: 0)
          |   |
          |   |---(Name: LOInnerLoad[0] Schema: name1#121:bytearray)
          |   |
          |   |---(Name: LOInnerLoad[1] Schema: name2#122:bytearray)
          |   |
          |   |---(Name: LOInnerLoad[2] Schema: age#105:int)
          |   |
          |   |---(Name: LOInnerLoad[3] Schema: gpr#106:float)
          |
          |---B: (Name: LOForEach Schema: name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float)
              |   |
              |   (Name: LOGenerate[true,false,false] Schema: name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float)
              |   |   |
              |   |   (Name: UserFunc(org.apache.pig.builtin.STRSPLIT) Type: tuple Uid: 132)
              |   |   |
              |   |   |---(Name: Cast Type: chararray Uid: 104)
              |   |   |   |
              |   |   |   |---name:(Name: Project Type: bytearray Uid: 104 Input: 0 Column: (*))
              |   |   |
              |   |   |---(Name: Constant Type: chararray Uid: 131)
              |   |   |
              |   |   (Name: Cast Type: int Uid: 105)
              |   |   |
              |   |   |---age:(Name: Project Type: bytearray Uid: 105 Input: 1 Column: (*))
              |   |   |
              |   |   (Name: Cast Type: float Uid: 106)
              |   |   |
              |   |   |---gpr:(Name: Project Type: bytearray Uid: 106 Input: 2 Column: (*))
              |   |
              |   |---(Name: LOInnerLoad[0] Schema: name#104:bytearray)
              |   |
              |   |---(Name: LOInnerLoad[1] Schema: age#105:bytearray)
              |   |
              |   |---(Name: LOInnerLoad[2] Schema: gpr#106:bytearray)
              |
              |---A: (Name: LOLoad Schema: name#104:bytearray,age#105:bytearray,gpr#106:bytearray)RequiredFields:null
      

       

      Attachments

        1. pig-5404-v01.patch
          3 kB
          Koji Noguchi

        Activity

          People

            knoguchi Koji Noguchi
            bpusztahazi Bruno Pusztahazi
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: