Uploaded image for project: 'Sqoop'
  1. Sqoop
  2. SQOOP-3068

Enhance error (tool.ImportTool: Encountered IOException running import job: java.io.IOException: Expected schema) to suggest workaround (--map-column-java)

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.6
    • Fix Version/s: 1.4.7
    • Component/s: hive-integration
    • Labels:
      None

      Description

      Please consider enhancing the error to include more detail and suggest workaround (--map-columns-java).

      Sqoop (import + --hive-import + --as-parquetfile) can fail due to a mismatch with the json schema that Hive produces vs. the json schema that Sqoop generates. The test case below demonstrates how to reproduce the issue as well as workaround it.

      SETUP (create parquet table with Sqoop import and Beeline CTAS)

      STEP 01 - Create MySQL Tables
      
      sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "drop table t1"
      sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "create table t1 (c_int int, c_date date, c_timestamp timestamp)"
      sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "describe t1"
      ---------------------------------------------------------------------------------------------------------
      | Field                | Type                 | Null | Key | Default              | Extra                | 
      ---------------------------------------------------------------------------------------------------------
      | c_int                | int(11)              | YES |     | (null)               |                      | 
      | c_date               | date                 | YES |     | (null)               |                      | 
      | c_timestamp          | timestamp            | NO  |     | CURRENT_TIMESTAMP    | on update CURRENT_TIMESTAMP | 
      ---------------------------------------------------------------------------------------------------------
      
      STEP 02 : Insert and Select Row
      
      sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "insert into t1 values (1, current_date(), current_timestamp())"
      sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select * from t1"
      --------------------------------------------------
      | c_int       | c_date     | c_timestamp         | 
      --------------------------------------------------
      | 1           | 2016-10-26 | 2016-10-26 14:30:33.0 | 
      --------------------------------------------------
      
      STEP 03 : Create Hive Tables
      
      beeline -u jdbc:hive2:// -e "use default; drop table t1"
      sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username $MYUSER --password $MYPSWD --table t1 --hcatalog-database default --hcatalog-table t1_text --create-hcatalog-table --hcatalog-storage-stanza 'stored as parquet' --num-mappers 1
      beeline -u jdbc:hive2:// -e "use default; create table t1 stored as parquet as select * from t1_text;show create table t1;"
      
      +----------------------------------------------------+--+
      |                   createtab_stmt                   |
      +----------------------------------------------------+--+
      | CREATE TABLE `t1`(                                 |
      |   `c_int` int,                                     |
      |   `c_date` string,                                 |
      |   `c_timestamp` string)                            |
      | ROW FORMAT SERDE                                   |
      |   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
      | STORED AS INPUTFORMAT                              |
      |   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  |
      | OUTPUTFORMAT                                       |
      |   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
      | LOCATION                                           |
      |   'hdfs://<namenode>:8020/user/hive/warehouse/t1' |
      | TBLPROPERTIES (                                    |
      |   'COLUMN_STATS_ACCURATE'='true',                  |
      |   'numFiles'='1',                                  |
      |   'numRows'='2',                                   |
      |   'rawDataSize'='6',                               |
      |   'totalSize'='605',                               |
      |   'transient_lastDdlTime'='1478298298')            |
      +----------------------------------------------------+--+
      

      REPRODUCE ISSUE (import --hive-import append and overwrite)

      STEP 01: Attempt --hive-import --append
      	
      sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default --hive-table t1 --as-parquetfile --num-mappers 1 --append
      
      16/11/02 20:54:23 WARN mapreduce.DataDrivenImportJob: Target Hive table 't1' exists! Sqoop will append data into the existing Hive table. Consider using --hive-overwrite, if you do NOT intend to do appending.
      16/11/02 20:54:24 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Expected schema: {"type":"record","name":"t1","fields":[{"name":"c_int","type":["null","int"],"doc":"Converted from 'int'","default":null},{"name":"c_date","type":["null","string"],"doc":"Converted from 'string'","default":null},{"name":"c_timestamp","type":["null","string"],"doc":"Converted from 'string'","default":null}]}
      Actual schema: {"type":"record","name":"t1","doc":"Sqoop import of t1","fields":[{"name":"c_int","type":["null","int"],"default":null,"columnName":"c_int","sqlType":"4"},{"name":"c_date","type":["null","long"],"default":null,"columnName":"c_date","sqlType":"91"},{"name":"c_timestamp","type":["null","long"],"default":null,"columnName":"c_timestamp","sqlType":"93"}],"tableName":"t1"}
      
      STEP 02: Attempt --hive-import --hive-overwrite
      
      sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default --hive-table t1 --as-parquetfile --num-mappers 1 --hive-overwrite
      
      16/11/02 20:56:55 INFO hive.metastore: Connected to metastore.
      16/11/02 20:56:56 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Expected schema: {"type":"record","name":"t1","fields":[{"name":"c_int","type":["null","int"],"doc":"Converted from 'int'","default":null},{"name":"c_date","type":["null","string"],"doc":"Converted from 'string'","default":null},{"name":"c_timestamp","type":["null","string"],"doc":"Converted from 'string'","default":null}]}
      Actual schema: {"type":"record","name":"t1","doc":"Sqoop import of t1","fields":[{"name":"c_int","type":["null","int"],"default":null,"columnName":"c_int","sqlType":"4"},{"name":"c_date","type":["null","long"],"default":null,"columnName":"c_date","sqlType":"91"},{"name":"c_timestamp","type":["null","long"],"default":null,"columnName":"c_timestamp","sqlType":"93"}],"tableName":"t1"}
      

        Attachments

        1. SQOOP-3068.patch
          17 kB
          Szabolcs Vasas
        2. SQOOP-3068.patch
          17 kB
          Szabolcs Vasas
        3. SQOOP-3068.patch
          18 kB
          Szabolcs Vasas
        4. SQOOP-3068.patch
          20 kB
          Szabolcs Vasas

          Issue Links

            Activity

              People

              • Assignee:
                vasas Szabolcs Vasas
                Reporter:
                markuskemper@me.com Markus Kemper
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: