Sqoop
  1. Sqoop
  2. SQOOP-489

Cannot define partition keys for Hive tables created through Sqoop

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.1-incubating
    • Fix Version/s: 1.4.2
    • Component/s: None
    • Labels:
      None

      Description

      By enabling the table option, Sqoop includes every column in the table in the create table query, and by enabling the hive-partition-key option, Sqoop blindly appends the "partitioned by" clause. Now if you specify one of columns in the table in the hive-partition-key, this will cause a syntax error in Hive.

      For example, if we have a table 'FOO' that has columns 'I' and 'J':

      sqoop create-hive-table --table FOO ...

      will generate the following Hive query:

      CREATE TABLE IF NOT EXISTS `FOO` ( `I` STRING, `J` STRING)

      Now if we add "--hive-partition-key I" to the command, Sqoop generates the following query:

      CREATE TABLE IF NOT EXISTS `FOO` ( `I` STRING, `J` STRING) PARTITIONED BY (I STRING)

      The problem is that since 'I' is defined twice (once in CRATE TABLE and once in PARTITIONED BY), this is a syntax error in Hive.

      This correct query would be something like:

      CREATE TABLE IF NOT EXISTS `FOO` (`J` STRING) PARTITIONED BY (I STRING)

      1. SQOOP-489.patch
        4 kB
        Cheolsoo Park
      2. SQOOP-489.patch
        4 kB
        Cheolsoo Park
      3. SQOOP-489.patch
        0.9 kB
        Cheolsoo Park

        Activity

        Hide
        Cheolsoo Park added a comment -

        Thanks for filing a jira Kate.

        To fix the issue, I suggest that we exclude the column that is specified by partition key from the column definitions if any partition key is specified.

        i.e.

        CREATE TABLE IF NOT EXISTS `FOO` (`I` STRING, `J` STRING) PARTITIONED BY (I STRING)
        

        =>

        CREATE TABLE IF NOT EXISTS `FOO` (`J` STRING) PARTITIONED BY (I STRING)
        

        I also thought about adding the --columns option to hive-create-table, but this seems less preferable because this will force the user to specify all the columns except partition key in the command to import table.

        Please let me know if anyone has a better suggestion.

        Another note that I'd like to add is that in my experiment, the current query sometimes works in Hive while it sometimes doesn't. So you might find it work for you. Nevertheless, I confirmed with a Hive developer that this should be a syntax error, so we should not rely on this Hive behavior.

        Show
        Cheolsoo Park added a comment - Thanks for filing a jira Kate. To fix the issue, I suggest that we exclude the column that is specified by partition key from the column definitions if any partition key is specified. i.e. CREATE TABLE IF NOT EXISTS `FOO` (`I` STRING, `J` STRING) PARTITIONED BY (I STRING) => CREATE TABLE IF NOT EXISTS `FOO` (`J` STRING) PARTITIONED BY (I STRING) I also thought about adding the --columns option to hive-create-table, but this seems less preferable because this will force the user to specify all the columns except partition key in the command to import table. Please let me know if anyone has a better suggestion. Another note that I'd like to add is that in my experiment, the current query sometimes works in Hive while it sometimes doesn't. So you might find it work for you. Nevertheless, I confirmed with a Hive developer that this should be a syntax error, so we should not rely on this Hive behavior.
        Hide
        Jarek Jarcec Cecho added a comment -

        I believe that adjusting generated "Create table" statement is only half of the job. We also need to stop importing those columns, because otherwise we would end up with different number of columns in imported file and in hive which might lead to a huge confusion on user side.

        Jarcec

        Show
        Jarek Jarcec Cecho added a comment - I believe that adjusting generated "Create table" statement is only half of the job. We also need to stop importing those columns, because otherwise we would end up with different number of columns in imported file and in hive which might lead to a huge confusion on user side. Jarcec
        Hide
        Cheolsoo Park added a comment -

        Hello Jarcec,

        Thank you for your comment. I am sorry, but I seems a bit confused about your concern. Even excluding the partition key from the create table statement, hive table will still have that column, doesn't it? So this change only applies to the statement but not applied to the table itself.

        Also, unless getColumnNames() is modified, it shouldn't effect the number of columns in import job. Am I missing something here? Please correct me if I am wrong.

        Show
        Cheolsoo Park added a comment - Hello Jarcec, Thank you for your comment. I am sorry, but I seems a bit confused about your concern. Even excluding the partition key from the create table statement, hive table will still have that column, doesn't it? So this change only applies to the statement but not applied to the table itself. Also, unless getColumnNames() is modified, it shouldn't effect the number of columns in import job. Am I missing something here? Please correct me if I am wrong.
        Hide
        Cheolsoo Park added a comment - - edited

        I need to do more testing, but this was basically what I had in mind:
        https://issues.apache.org/jira/secure/attachment/12529678/SQOOP-489.patch

        Please let me know if you see any problem with my approach. Thanks.

        Show
        Cheolsoo Park added a comment - - edited I need to do more testing, but this was basically what I had in mind: https://issues.apache.org/jira/secure/attachment/12529678/SQOOP-489.patch Please let me know if you see any problem with my approach. Thanks.
        Hide
        Jarek Jarcec Cecho added a comment -

        I wasn't checking sources so I might be wrong here. But let me explain what I was afraid of on example:

        Imagine table with three columns A, B, C with one row - 1, 2, 3. User specify that A should be used as partitioned column. Then after applying suggested fix we will create following create table statement:

        CREATE TABLE IF NOT EXISTS `FOO` (`B` STRING, `C` STRING) PARTITIONED BY (A STRING)

        But sqoop import will by default import all columns and thus resulting in file "1, 2, 3". Now when hive will read the file column B will contain value 1 (e.g. value of A), column C will contain value 2 (value of column B) and value 3 for column C will be missing entirely from hive perspective. Again I did no research nor try it in practice. It's solely based on my experience and I wanted to express my worries prior any coding.

        Is it more clear now?

        Jarcec

        Show
        Jarek Jarcec Cecho added a comment - I wasn't checking sources so I might be wrong here. But let me explain what I was afraid of on example: Imagine table with three columns A, B, C with one row - 1, 2, 3. User specify that A should be used as partitioned column. Then after applying suggested fix we will create following create table statement: CREATE TABLE IF NOT EXISTS `FOO` (`B` STRING, `C` STRING) PARTITIONED BY (A STRING) But sqoop import will by default import all columns and thus resulting in file "1, 2, 3". Now when hive will read the file column B will contain value 1 (e.g. value of A), column C will contain value 2 (value of column B) and value 3 for column C will be missing entirely from hive perspective. Again I did no research nor try it in practice. It's solely based on my experience and I wanted to express my worries prior any coding. Is it more clear now? Jarcec
        Hide
        Cheolsoo Park added a comment -

        Hi Jarcec, that is a good point. The order of columns could be a problem. I will verify this and get back to you.

        Thank you for the explanation!

        Show
        Cheolsoo Park added a comment - Hi Jarcec, that is a good point. The order of columns could be a problem. I will verify this and get back to you. Thank you for the explanation!
        Hide
        Cheolsoo Park added a comment -

        Hi Jarcec, you're right. My patch screws up Hive view if I do Hive import.

        Here is the result of my experiment:

        1. Table foo in Oracle

        SQL> select * from foo;
        
        	 I	    J	       K
        ---------- ---------- ----------
        	 1	    2	       3
        

        2. Sqoop command

        sqoop import --verbose ... --table SQOOPTEST.FOO -m 1 --hive-import --hive-table FOO --hive-partition-key I --hive-partition-value 1.0
        

        3. Output file in Hdfs

        1,2,3
        

        4. Hive view

        hive> select * from foo;
        OK
        1.0	2.0	1.0
        

        5. Hive table definition

        hive> SHOW TABLE EXTENDED  LIKE foo;
        OK
        tableName:foo
        owner:cheolsoo
        location:hdfs://localhost/user/hive/warehouse/foo
        inputformat:org.apache.hadoop.mapred.TextInputFormat
        outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        columns:struct columns { double j, double k}
        partitioned:true
        partitionColumns:struct partition_columns { string i}
        totalNumberFiles:2
        totalFileSize:6
        maxFileSize:6
        minFileSize:0
        lastAccessTime:1337993171274
        lastUpdateTime:1337993176838
        
        Show
        Cheolsoo Park added a comment - Hi Jarcec, you're right. My patch screws up Hive view if I do Hive import. Here is the result of my experiment: 1. Table foo in Oracle SQL> select * from foo; I J K ---------- ---------- ---------- 1 2 3 2. Sqoop command sqoop import --verbose ... --table SQOOPTEST.FOO -m 1 --hive- import --hive-table FOO --hive-partition-key I --hive-partition-value 1.0 3. Output file in Hdfs 1,2,3 4. Hive view hive> select * from foo; OK 1.0 2.0 1.0 5. Hive table definition hive> SHOW TABLE EXTENDED LIKE foo; OK tableName:foo owner:cheolsoo location:hdfs: //localhost/user/hive/warehouse/foo inputformat:org.apache.hadoop.mapred.TextInputFormat outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat columns:struct columns { double j, double k} partitioned: true partitionColumns:struct partition_columns { string i} totalNumberFiles:2 totalFileSize:6 maxFileSize:6 minFileSize:0 lastAccessTime:1337993171274 lastUpdateTime:1337993176838
        Hide
        Jarek Jarcec Cecho added a comment -

        I also do have one concern with automatic fix. This time, let me start with example

        Let's consider table with three columns A, B and C with two rows - "1, 2, 3" and "4, 5, 6". Also let say that we applied suggested fix and also solve the issue with importing all columns. Sqoop command would contain --hive-partition A --hive-value 666. Then file on HDFS would probably contain two rows - "2, 3" and "5, 6". View on hive would be:

        B C A
        2 3 666
        5 6 666

        I do not see an issue with changing column ordering, however I'm concerned that we silently ignored one column in the table and replaced it with constant value from command line – both values "1" and "4" from original table were lost.

        I do not have easy solution to address this problem and therefore I would suggest to fix this JIRA by gracefully exiting in case that user will specify partition column as one of the exported columns.

        Jarcec

        Show
        Jarek Jarcec Cecho added a comment - I also do have one concern with automatic fix. This time, let me start with example Let's consider table with three columns A, B and C with two rows - "1, 2, 3" and "4, 5, 6". Also let say that we applied suggested fix and also solve the issue with importing all columns. Sqoop command would contain --hive-partition A --hive-value 666. Then file on HDFS would probably contain two rows - "2, 3" and "5, 6". View on hive would be: B C A 2 3 666 5 6 666 I do not see an issue with changing column ordering, however I'm concerned that we silently ignored one column in the table and replaced it with constant value from command line – both values "1" and "4" from original table were lost. I do not have easy solution to address this problem and therefore I would suggest to fix this JIRA by gracefully exiting in case that user will specify partition column as one of the exported columns. Jarcec
        Hide
        Cheolsoo Park added a comment -

        Hi Jarcec, thank you for your kind explanation. Examples always help.

        I'm concerned that we silently ignored one column in the table and replaced it with constant value from command line – both values "1" and "4" from original table were lost.

        I have a quick question. The case that you're describing can be reproduced even now with '--columns B,C --hive-partition A --hive-partition-value 666', right? In this case, we're also replacing the column A with a constant value. Are you assuming that this is OK because the user explicitly specified what he/she is doing?

        After all, what I am debating is whether or not the Hive table always has to mirror the source table that it is importing from.

        Show
        Cheolsoo Park added a comment - Hi Jarcec, thank you for your kind explanation. Examples always help. I'm concerned that we silently ignored one column in the table and replaced it with constant value from command line – both values "1" and "4" from original table were lost. I have a quick question. The case that you're describing can be reproduced even now with '--columns B,C --hive-partition A --hive-partition-value 666', right? In this case, we're also replacing the column A with a constant value. Are you assuming that this is OK because the user explicitly specified what he/she is doing? After all, what I am debating is whether or not the Hive table always has to mirror the source table that it is importing from.
        Hide
        Jarek Jarcec Cecho added a comment -

        Yeah, I figured that it would be better to explain that on example with my English level

        To answer your question - yes exactly. If user will do '--columns B,C --hive-partition A --hive-partition-value 666' than it's his personal responsibility to know what he is doing. I'm afraid that do something similar behind his back might create confusion to some users.

        Jarcec

        Show
        Jarek Jarcec Cecho added a comment - Yeah, I figured that it would be better to explain that on example with my English level To answer your question - yes exactly. If user will do '--columns B,C --hive-partition A --hive-partition-value 666' than it's his personal responsibility to know what he is doing. I'm afraid that do something similar behind his back might create confusion to some users. Jarcec
        Hide
        Cheolsoo Park added a comment -

        OK, thanks for clarification, Jarcec. I agree with you.

        I think that there are two cases that we have to deal with:

        1) sqoop create-hive-table --table FOO --hive-partition-key X
        2) sqoop import --hive-import --table FOO --hive-partition-key X

        Following Jarcec's suggestion, the fix would be summarized as follows:

        For 1),
        --hive-partition-key X is invalid if X is a column of table FOO. (Note that as of now, --columns is not available for create-hive-table.)

        For 2),
        i) if --columns is not specified, --hive-partition-key X is invalid if X is a column of table FOO.
        ii) if --columns is specified, --hive-partition-key X is invalid if X is specified in --columns.

        Currently, Sqoop generates ill-formed Hive queries for these cases; however, Sqoop should rather print a proper warning message.

        Please let me know if anyone has other opinions.

        Show
        Cheolsoo Park added a comment - OK, thanks for clarification, Jarcec. I agree with you. I think that there are two cases that we have to deal with: 1) sqoop create-hive-table --table FOO --hive-partition-key X 2) sqoop import --hive-import --table FOO --hive-partition-key X Following Jarcec's suggestion, the fix would be summarized as follows: For 1), --hive-partition-key X is invalid if X is a column of table FOO. (Note that as of now, --columns is not available for create-hive-table.) For 2), i) if --columns is not specified, --hive-partition-key X is invalid if X is a column of table FOO. ii) if --columns is specified, --hive-partition-key X is invalid if X is specified in --columns. Currently, Sqoop generates ill-formed Hive queries for these cases; however, Sqoop should rather print a proper warning message. Please let me know if anyone has other opinions.
        Hide
        Jarek Jarcec Cecho added a comment -

        Cheolsoo: +1

        Jarcec

        Show
        Jarek Jarcec Cecho added a comment - Cheolsoo: +1 Jarcec
        Hide
        Cheolsoo Park added a comment -
        Show
        Cheolsoo Park added a comment - Review board: https://reviews.apache.org/r/5291/
        Hide
        Jarek Jarcec Cecho added a comment -

        Committed revision 1344429.

        Thank you very much for your patch Cheolsoo!

        Jarcec

        Show
        Jarek Jarcec Cecho added a comment - Committed revision 1344429. Thank you very much for your patch Cheolsoo! Jarcec
        Hide
        Hudson added a comment -

        Integrated in Sqoop-ant-jdk-1.6 #116 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6/116/)
        SQOOP-489. Cannot define partition keys for Hive tables created through Sqoop.

        (Cheolsoo Park via Jarek Jarcec Cecho) (Revision 1344429)

        Result = SUCCESS
        jarcec :
        Files :

        • /sqoop/trunk/src/java/org/apache/sqoop/hive/TableDefWriter.java
        • /sqoop/trunk/src/test/com/cloudera/sqoop/hive/TestHiveImport.java
        Show
        Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6 #116 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6/116/ ) SQOOP-489 . Cannot define partition keys for Hive tables created through Sqoop. (Cheolsoo Park via Jarek Jarcec Cecho) (Revision 1344429) Result = SUCCESS jarcec : Files : /sqoop/trunk/src/java/org/apache/sqoop/hive/TableDefWriter.java /sqoop/trunk/src/test/com/cloudera/sqoop/hive/TestHiveImport.java

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Kathleen Ting
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development