Hive
  1. Hive
  2. HIVE-857

Transform should produce objects in the same type as specified at runtime

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.4.1
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The following code fails at runtime, because the "clicked" field produced by "transform" is actually of type String at runtime, instead of boolean.

      INSERT OVERWRITE TABLE feature_index
      SELECT
       temp.feature,
       temp.ad_id,
       sum(if(temp.clicked, 1, 0)) / cast(count(temp.clicked) as DOUBLE) as clicked_percent
      FROM (
       SELECT concat('ua:', trim(lower(ua.feature))) as feature, ua.ad_id, ua.clicked
       FROM (
        MAP raw_logs.user_agent, raw_logs.ad_id, raw_logs.clicked
        USING 'my.py' as (feature STRING, ad_id STRING, clicked BOOLEAN)
        FROM raw_logs
       ) ua
      ) temp
      GROUP BY temp.feature, temp.ad_id;
      

        Activity

        Hide
        Zheng Shao added a comment -

        Explain and stack traces. Thanks for Andrew Hitchcock for reporting this:

        EXPLAIN INSERT OVERWRITE TABLE feature_index
        SELECT
         temp.feature,
         temp.ad_id,
         sum(if(temp.clicked, 1, 0)) / cast(count(temp.clicked) as DOUBLE) as clicked_percent
        FROM (
         SELECT concat('ua:', trim(lower(ua.feature))) as feature, ua.ad_id, ua.clicked
         FROM (
          MAP raw_logs.user_agent, raw_logs.ad_id, raw_logs.clicked
          USING 's3://anhi-test-programs/hive/split_user_agent.py' as (feature STRING, ad_id STRING, clicked BOOLEAN)
          FROM raw_logs
         ) ua
        ) temp
        GROUP BY temp.feature, temp.ad_id;
        
        OK
        ABSTRACT SYNTAX TREE:
          (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF raw_logs)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TRANSFORM (TOK_EXPLIST (. (TOK_TABLE_OR_COL raw_logs) user_agent) (. (TOK_TABLE_OR_COL raw_logs) ad_id) (. (TOK_TABLE_OR_COL raw_logs) clicked)) 's3://anhi-test-programs/hive/split_user_agent.py' (TOK_TABCOLLIST (TOK_TABCOL feature TOK_STRING) (TOK_TABCOL ad_id TOK_STRING) (TOK_TABCOL clicked TOK_BOOLEAN))))))) ua)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTION concat 'ua:' (TOK_FUNCTION trim (TOK_FUNCTION lower (. (TOK_TABLE_OR_COL ua) feature)))) feature) (TOK_SELEXPR (. (TOK_TABLE_OR_COL ua) ad_id)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL ua) clicked))))) temp)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB feature_index)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL temp) feature)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL temp) ad_id)) (TOK_SELEXPR (/ (TOK_FUNCTION sum (TOK_FUNCTION if (. (TOK_TABLE_OR_COL temp) clicked) 1 0)) (TOK_FUNCTION TOK_DOUBLE (TOK_FUNCTION count (. (TOK_TABLE_OR_COL temp) clicked)))) clicked_percent)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL temp) feature) (. (TOK_TABLE_OR_COL temp) ad_id))))
        
        STAGE DEPENDENCIES:
          Stage-1 is a root stage
          Stage-0 depends on stages: Stage-1
        
        STAGE PLANS:
          Stage: Stage-1
            Map Reduce
              Alias -> Map Operator Tree:
                temp:ua:raw_logs 
                  TableScan
                    alias: raw_logs
                    Select Operator
                      expressions:
                            expr: user_agent
                            type: string
                            expr: ad_id
                            type: string
                            expr: clicked
                            type: boolean
                      outputColumnNames: _col0, _col1, _col2
                      Transform Operator
                        command: s3://anhi-test-programs/hive/split_user_agent.py
                        output info:
                            input format: org.apache.hadoop.mapred.TextInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        Select Operator
                          expressions:
                                expr: concat('ua:', trim(lower(feature)))
                                type: string
                                expr: ad_id
                                type: string
                                expr: clicked
                                type: boolean
                          outputColumnNames: _col0, _col1, _col2
                          Select Operator
                            expressions:
                                  expr: _col0
                                  type: string
                                  expr: _col1
                                  type: string
                                  expr: _col2
                                  type: boolean
                            outputColumnNames: _col0, _col1, _col2
                            Group By Operator
                              aggregations:
                                    expr: sum(if(_col2, 1, 0))
                                    expr: count(_col2)
                              keys:
                                    expr: _col0
                                    type: string
                                    expr: _col1
                                    type: string
                              mode: hash
                              outputColumnNames: _col0, _col1, _col2, _col3
                              Reduce Output Operator
                                key expressions:
                                      expr: _col0
                                      type: string
                                      expr: _col1
                                      type: string
                                sort order: ++
                                Map-reduce partition columns:
                                      expr: _col0
                                      type: string
                                      expr: _col1
                                      type: string
                                tag: -1
                                value expressions:
                                      expr: _col2
                                      type: bigint
                                      expr: _col3
                                      type: bigint
              Reduce Operator Tree:
                Group By Operator
                  aggregations:
                        expr: sum(VALUE._col0)
                        expr: count(VALUE._col1)
                  keys:
                        expr: KEY._col0
                        type: string
                        expr: KEY._col1
                        type: string
                  mode: mergepartial
                  outputColumnNames: _col0, _col1, _col2, _col3
                  Select Operator
                    expressions:
                          expr: _col0
                          type: string
                          expr: _col1
                          type: string
                          expr: (UDFToDouble(_col2) / UDFToDouble(_col3))
                          type: double
                    outputColumnNames: _col0, _col1, _col2
                    File Output Operator
                      compressed: false
                      GlobalTableId: 1
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          name: feature_index
        
          Stage: Stage-0
            Move Operator
              tables:
                  replace: true
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      name: feature_index
        
        
        And the Hadoop stack trace:
        
        java.lang.RuntimeException: Map operator initialization failed                                                                                                           
                at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:110)                                                                                      
                at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)                                                                                       
                at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)                                                                                   
                at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)                                                                                               
                at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)                                                                                       
                at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)                                                                                   
                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:224)                                                                                                        
                at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2204)                                                                                        
        Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot initialize ScriptOperator                                                                            
                at org.apache.hadoop.hive.ql.exec.ScriptOperator.initializeOp(ScriptOperator.java:266)                                                                           
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295)                                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:332)                                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:317)                                                                                 
                at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:58)                                                                            
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295)                                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:332)                                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:317)                                                                                 
                at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:303)                                                                                       
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295)                                                                                         
                at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:301)                                                                                 
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295)                                                                                         
                at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:82)                                                                                       
                ... 7 more                                                                                                                                                       
        Caused by: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The first argument of function IF should be "boolean", but "string"                                  
         is found                                                                                                                                                                
                at org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.initialize(GenericUDFIf.java:58)                                                                           
                at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:77)                                                  
                at org.apache.hadoop.hive.ql.exec.GroupByOperator.initializeOp(GroupByOperator.java:179)                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295)                                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:332)                                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:317)                                                                                 
                at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:58)                                                                            
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295)                                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:332)                                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:317)                                                                                 
                at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:58)                                                                            
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295)                                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:332)                                                                                         
                at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:317)                                                                                 
                at org.apache.hadoop.hive.ql.exec.ScriptOperator.initializeOp(ScriptOperator.java:262)                                                                           
                ... 19 more   
        
        Show
        Zheng Shao added a comment - Explain and stack traces. Thanks for Andrew Hitchcock for reporting this: EXPLAIN INSERT OVERWRITE TABLE feature_index SELECT temp.feature, temp.ad_id, sum( if (temp.clicked, 1, 0)) / cast (count(temp.clicked) as DOUBLE) as clicked_percent FROM ( SELECT concat('ua:', trim(lower(ua.feature))) as feature, ua.ad_id, ua.clicked FROM ( MAP raw_logs.user_agent, raw_logs.ad_id, raw_logs.clicked USING 's3: //anhi-test-programs/hive/split_user_agent.py' as (feature STRING, ad_id STRING, clicked BOOLEAN) FROM raw_logs ) ua ) temp GROUP BY temp.feature, temp.ad_id; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF raw_logs)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TRANSFORM (TOK_EXPLIST (. (TOK_TABLE_OR_COL raw_logs) user_agent) (. (TOK_TABLE_OR_COL raw_logs) ad_id) (. (TOK_TABLE_OR_COL raw_logs) clicked)) 's3: //anhi-test-programs/hive/split_user_agent.py' (TOK_TABCOLLIST (TOK_TABCOL feature TOK_STRING) (TOK_TABCOL ad_id TOK_STRING) (TOK_TABCOL clicked TOK_BOOLEAN))))))) ua)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTION concat 'ua:' (TOK_FUNCTION trim (TOK_FUNCTION lower (. (TOK_TABLE_OR_COL ua) feature)))) feature) (TOK_SELEXPR (. (TOK_TABLE_OR_COL ua) ad_id)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL ua) clicked))))) temp)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB feature_index)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL temp) feature)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL temp) ad_id)) (TOK_SELEXPR (/ (TOK_FUNCTION sum (TOK_FUNCTION if (. (TOK_TABLE_OR_COL temp) clicked) 1 0)) (TOK_FUNCTION TOK_DOUBLE (TOK_FUNCTION count (. (TOK_TABLE_OR_COL temp) clicked)))) clicked_percent)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL temp) feature) (. (TOK_TABLE_OR_COL temp) ad_id)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: temp:ua:raw_logs TableScan alias: raw_logs Select Operator expressions: expr: user_agent type: string expr: ad_id type: string expr: clicked type: boolean outputColumnNames: _col0, _col1, _col2 Transform Operator command: s3: //anhi-test-programs/hive/split_user_agent.py output info: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Select Operator expressions: expr: concat('ua:', trim(lower(feature))) type: string expr: ad_id type: string expr: clicked type: boolean outputColumnNames: _col0, _col1, _col2 Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: boolean outputColumnNames: _col0, _col1, _col2 Group By Operator aggregations: expr: sum( if (_col2, 1, 0)) expr: count(_col2) keys: expr: _col0 type: string expr: _col1 type: string mode: hash outputColumnNames: _col0, _col1, _col2, _col3 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string sort order: ++ Map-reduce partition columns: expr: _col0 type: string expr: _col1 type: string tag: -1 value expressions: expr: _col2 type: bigint expr: _col3 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) expr: count(VALUE._col1) keys: expr: KEY._col0 type: string expr: KEY._col1 type: string mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3 Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: (UDFToDouble(_col2) / UDFToDouble(_col3)) type: double outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 1 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: feature_index Stage: Stage-0 Move Operator tables: replace: true table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: feature_index And the Hadoop stack trace: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:110) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:224) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2204) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot initialize ScriptOperator at org.apache.hadoop.hive.ql.exec.ScriptOperator.initializeOp(ScriptOperator.java:266) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:332) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:317) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:58) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:332) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:317) at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:303) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:301) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:82) ... 7 more Caused by: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The first argument of function IF should be " boolean " , but "string" is found at org.apache.hadoop.hive.ql.udf. generic .GenericUDFIf.initialize(GenericUDFIf.java:58) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.GroupByOperator.initializeOp(GroupByOperator.java:179) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:332) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:317) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:58) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:332) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:317) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:58) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:295) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:332) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:317) at org.apache.hadoop.hive.ql.exec.ScriptOperator.initializeOp(ScriptOperator.java:262) ... 19 more
        Hide
        Zheng Shao added a comment -

        This only affects 0.4.

        The transaction that added the new syntax is r803371, HIVE-743. Let user specify serde for custom sctipts. (Namit Jain via rmurthy)
        The transaction that fixed the issue is r807330, HIVE-708. Add TypedBytes SerDe for transform. (Namit Jain via zshao).

        We may either pull in the whole of HIVE-708 to branch 0.4, or split HIVE-708's transaction.

        Show
        Zheng Shao added a comment - This only affects 0.4. The transaction that added the new syntax is r803371, HIVE-743 . Let user specify serde for custom sctipts. (Namit Jain via rmurthy) The transaction that fixed the issue is r807330, HIVE-708 . Add TypedBytes SerDe for transform. (Namit Jain via zshao). We may either pull in the whole of HIVE-708 to branch 0.4, or split HIVE-708 's transaction.
        Hide
        Zheng Shao added a comment -

        This patch is mainly a subset of HIVE-708, which does NOT include TypedBytes. TypedBytes is a new feature - we do not need to propagate it back to branch-0.4.

        Show
        Zheng Shao added a comment - This patch is mainly a subset of HIVE-708 , which does NOT include TypedBytes. TypedBytes is a new feature - we do not need to propagate it back to branch-0.4.
        Hide
        Namit Jain added a comment -

        Committed, Thanks Zheng

        Show
        Namit Jain added a comment - Committed, Thanks Zheng

          People

          • Assignee:
            Zheng Shao
            Reporter:
            Zheng Shao
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development