Sqoop
  1. Sqoop
  2. SQOOP-342

Allow user to override sqoop type mapping

    Details

      Description

      It would be nice if user would be able to override any type implicit sqoop mapping (from SQL to Java or Hive) on command line.

      1. SQOOP-342.patch
        25 kB
        Jarek Jarcec Cecho

        Activity

        Hide
        Hudson added a comment -

        Integrated in Sqoop-jdk-1.6 #38 (See https://builds.apache.org/job/Sqoop-jdk-1.6/38/)
        SQOOP-342. Allow user to override Sqoop type mapping.

        (Jarek Jarcec Checho via Arvind Prabhakar)

        arvind : http://svn.apache.org/viewvc/?view=rev&rev=1182523
        Files :

        • /incubator/sqoop/trunk/src/docs/man/codegen-args.txt
        • /incubator/sqoop/trunk/src/docs/man/hive-args.txt
        • /incubator/sqoop/trunk/src/docs/man/import-args.txt
        • /incubator/sqoop/trunk/src/docs/user/codegen-args.txt
        • /incubator/sqoop/trunk/src/docs/user/codegen.txt
        • /incubator/sqoop/trunk/src/docs/user/hive-args.txt
        • /incubator/sqoop/trunk/src/docs/user/import.txt
        • /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/SqoopOptions.java
        • /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/hive/TableDefWriter.java
        • /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/orm/ClassWriter.java
        • /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java
        • /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/tool/ImportTool.java
        • /incubator/sqoop/trunk/src/test/com/cloudera/sqoop/TestSqoopOptions.java
        • /incubator/sqoop/trunk/src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java
        • /incubator/sqoop/trunk/src/test/com/cloudera/sqoop/orm/TestClassWriter.java
        Show
        Hudson added a comment - Integrated in Sqoop-jdk-1.6 #38 (See https://builds.apache.org/job/Sqoop-jdk-1.6/38/ ) SQOOP-342 . Allow user to override Sqoop type mapping. (Jarek Jarcec Checho via Arvind Prabhakar) arvind : http://svn.apache.org/viewvc/?view=rev&rev=1182523 Files : /incubator/sqoop/trunk/src/docs/man/codegen-args.txt /incubator/sqoop/trunk/src/docs/man/hive-args.txt /incubator/sqoop/trunk/src/docs/man/import-args.txt /incubator/sqoop/trunk/src/docs/user/codegen-args.txt /incubator/sqoop/trunk/src/docs/user/codegen.txt /incubator/sqoop/trunk/src/docs/user/hive-args.txt /incubator/sqoop/trunk/src/docs/user/import.txt /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/SqoopOptions.java /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/hive/TableDefWriter.java /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/orm/ClassWriter.java /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/tool/ImportTool.java /incubator/sqoop/trunk/src/test/com/cloudera/sqoop/TestSqoopOptions.java /incubator/sqoop/trunk/src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java /incubator/sqoop/trunk/src/test/com/cloudera/sqoop/orm/TestClassWriter.java
        Hide
        Arvind Prabhakar added a comment -

        Patch committed. Thanks Jarek!

        Show
        Arvind Prabhakar added a comment - Patch committed. Thanks Jarek!
        Hide
        Jarek Jarcec Cecho added a comment -

        Final version of the patch without AVRO support.

        Show
        Jarek Jarcec Cecho added a comment - Final version of the patch without AVRO support.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1975/#review2535
        -----------------------------------------------------------

        Ship it!

        +1 Lets get this patch in and file a follow-up JIRA to fix Avro support for mapping override. Please attach the patch to the JIRA issue inorder for it to be committed.

        • Arvind

        On 2011-10-08 07:14:15, Jarek Jarcec wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1975/

        -----------------------------------------------------------

        (Updated 2011-10-08 07:14:15)

        Review request for Sqoop and Arvind Prabhakar.

        Summary

        -------

        This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch.

        Things that are missing and I'll add them if this way will be accepted:

        * Tests

        * Documentation

        * Supporting for type names (so that user don't have to type the integer constants on command line)

        Any feedback will be greatly appreciated.

        This addresses bug sqoop-342.

        https://issues.apache.org/jira/browse/sqoop-342

        Diffs

        -----

        /src/docs/man/codegen-args.txt 1180125

        /src/docs/man/hive-args.txt 1180125

        /src/docs/man/import-args.txt 1180125

        /src/docs/user/codegen-args.txt 1180125

        /src/docs/user/codegen.txt 1180125

        /src/docs/user/hive-args.txt 1180125

        /src/docs/user/import.txt 1180125

        /src/java/com/cloudera/sqoop/SqoopOptions.java 1180125

        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1180125

        /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1180125

        /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1180125

        /src/java/com/cloudera/sqoop/tool/ImportTool.java 1180125

        /src/test/com/cloudera/sqoop/TestSqoopOptions.java 1180125

        /src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 1180125

        /src/test/com/cloudera/sqoop/orm/TestClassWriter.java 1180125

        Diff: https://reviews.apache.org/r/1975/diff

        Testing

        -------

        Thanks,

        Jarek

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/#review2535 ----------------------------------------------------------- Ship it! +1 Lets get this patch in and file a follow-up JIRA to fix Avro support for mapping override. Please attach the patch to the JIRA issue inorder for it to be committed. Arvind On 2011-10-08 07:14:15, Jarek Jarcec wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/ ----------------------------------------------------------- (Updated 2011-10-08 07:14:15) Review request for Sqoop and Arvind Prabhakar. Summary ------- This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch. Things that are missing and I'll add them if this way will be accepted: * Tests * Documentation * Supporting for type names (so that user don't have to type the integer constants on command line) Any feedback will be greatly appreciated. This addresses bug sqoop-342. https://issues.apache.org/jira/browse/sqoop-342 Diffs ----- /src/docs/man/codegen-args.txt 1180125 /src/docs/man/hive-args.txt 1180125 /src/docs/man/import-args.txt 1180125 /src/docs/user/codegen-args.txt 1180125 /src/docs/user/codegen.txt 1180125 /src/docs/user/hive-args.txt 1180125 /src/docs/user/import.txt 1180125 /src/java/com/cloudera/sqoop/SqoopOptions.java 1180125 /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1180125 /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1180125 /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1180125 /src/java/com/cloudera/sqoop/tool/ImportTool.java 1180125 /src/test/com/cloudera/sqoop/TestSqoopOptions.java 1180125 /src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 1180125 /src/test/com/cloudera/sqoop/orm/TestClassWriter.java 1180125 Diff: https://reviews.apache.org/r/1975/diff Testing ------- Thanks, Jarek
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1975/
        -----------------------------------------------------------

        (Updated 2011-10-08 07:14:15.084689)

        Review request for Sqoop and Arvind Prabhakar.

        Changes
        -------

        Hi Arvind,
        I've incorporated all your suggestions with one exception - I left Avro untouched. It's not trivial to add this kind of support for Avro, because it seems to me that Avro is expecting numbers and not type names. So I'll probably introduce new parameter for overriding Avro mapping and put some code to allow user specify names that will be automatically translated into appropriate integers.

        Doing that is not that straightforward and because there is currently demand for this patch I think that it would be better to close this issue with current functionality and add Avro support later. I've already put validation for parameters that will rise an exception in case that both mapping and Avro output are requested.

        What do you think?

        Jarcec

        Summary
        -------

        This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch.

        Things that are missing and I'll add them if this way will be accepted:

        • Tests
        • Documentation
        • Supporting for type names (so that user don't have to type the integer constants on command line)

        Any feedback will be greatly appreciated.

        This addresses bug sqoop-342.
        https://issues.apache.org/jira/browse/sqoop-342

        Diffs (updated)


        /src/docs/man/codegen-args.txt 1180125
        /src/docs/man/hive-args.txt 1180125
        /src/docs/man/import-args.txt 1180125
        /src/docs/user/codegen-args.txt 1180125
        /src/docs/user/codegen.txt 1180125
        /src/docs/user/hive-args.txt 1180125
        /src/docs/user/import.txt 1180125
        /src/java/com/cloudera/sqoop/SqoopOptions.java 1180125
        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1180125
        /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1180125
        /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1180125
        /src/java/com/cloudera/sqoop/tool/ImportTool.java 1180125
        /src/test/com/cloudera/sqoop/TestSqoopOptions.java 1180125
        /src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 1180125
        /src/test/com/cloudera/sqoop/orm/TestClassWriter.java 1180125

        Diff: https://reviews.apache.org/r/1975/diff

        Testing
        -------

        Thanks,

        Jarek

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/ ----------------------------------------------------------- (Updated 2011-10-08 07:14:15.084689) Review request for Sqoop and Arvind Prabhakar. Changes ------- Hi Arvind, I've incorporated all your suggestions with one exception - I left Avro untouched. It's not trivial to add this kind of support for Avro, because it seems to me that Avro is expecting numbers and not type names. So I'll probably introduce new parameter for overriding Avro mapping and put some code to allow user specify names that will be automatically translated into appropriate integers. Doing that is not that straightforward and because there is currently demand for this patch I think that it would be better to close this issue with current functionality and add Avro support later. I've already put validation for parameters that will rise an exception in case that both mapping and Avro output are requested. What do you think? Jarcec Summary ------- This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch. Things that are missing and I'll add them if this way will be accepted: Tests Documentation Supporting for type names (so that user don't have to type the integer constants on command line) Any feedback will be greatly appreciated. This addresses bug sqoop-342. https://issues.apache.org/jira/browse/sqoop-342 Diffs (updated) /src/docs/man/codegen-args.txt 1180125 /src/docs/man/hive-args.txt 1180125 /src/docs/man/import-args.txt 1180125 /src/docs/user/codegen-args.txt 1180125 /src/docs/user/codegen.txt 1180125 /src/docs/user/hive-args.txt 1180125 /src/docs/user/import.txt 1180125 /src/java/com/cloudera/sqoop/SqoopOptions.java 1180125 /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1180125 /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1180125 /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1180125 /src/java/com/cloudera/sqoop/tool/ImportTool.java 1180125 /src/test/com/cloudera/sqoop/TestSqoopOptions.java 1180125 /src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 1180125 /src/test/com/cloudera/sqoop/orm/TestClassWriter.java 1180125 Diff: https://reviews.apache.org/r/1975/diff Testing ------- Thanks, Jarek
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1975/#review2279
        -----------------------------------------------------------

        Changes look good Jarek. Some feedback noted below:

        /src/docs/user/import.txt
        <https://reviews.apache.org/r/1975/#comment5253>

        nit:Trailing whitespace.

        /src/docs/user/import.txt
        <https://reviews.apache.org/r/1975/#comment5254>

        nit:Trailing whitespace.

        /src/docs/user/import.txt
        <https://reviews.apache.org/r/1975/#comment5255>

        nit:Trailing whitespace.

        /src/java/com/cloudera/sqoop/SqoopOptions.java
        <https://reviews.apache.org/r/1975/#comment5294>

        Nice to have: Consider storing these as properties object. Look at connectionParams as an example.

        /src/java/com/cloudera/sqoop/SqoopOptions.java
        <https://reviews.apache.org/r/1975/#comment5295>

        Nice to have: Basic validations here. Perhaps a Class.forName() test to see if the type actually corresponds to an actual object?

        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java
        <https://reviews.apache.org/r/1975/#comment5265>

        Nice to have: Better to reword this to something like "No column by the name found while importing data". The user may not be aware of what a result set actually is.

        /src/java/com/cloudera/sqoop/orm/ClassWriter.java
        <https://reviews.apache.org/r/1975/#comment5293>

        This method needs to be overwritten, otherwise the Avro Datafile format will likely not work. See AvroSchemaGenerator.generate() method for details.

        • Arvind

        On 2011-10-02 08:52:10, Jarek Jarcec wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1975/

        -----------------------------------------------------------

        (Updated 2011-10-02 08:52:10)

        Review request for Sqoop and Arvind Prabhakar.

        Summary

        -------

        This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch.

        Things that are missing and I'll add them if this way will be accepted:

        * Tests

        * Documentation

        * Supporting for type names (so that user don't have to type the integer constants on command line)

        Any feedback will be greatly appreciated.

        This addresses bug sqoop-342.

        https://issues.apache.org/jira/browse/sqoop-342

        Diffs

        -----

        /src/docs/man/codegen-args.txt 1177987

        /src/docs/man/hive-args.txt 1177987

        /src/docs/man/import-args.txt 1177987

        /src/docs/user/codegen-args.txt 1177987

        /src/docs/user/codegen.txt 1177987

        /src/docs/user/hive-args.txt 1177987

        /src/docs/user/import.txt 1177987

        /src/java/com/cloudera/sqoop/SqoopOptions.java 1177987

        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1177987

        /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1177987

        /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1177987

        /src/test/com/cloudera/sqoop/TestSqoopOptions.java 1177987

        /src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 1177987

        /src/test/com/cloudera/sqoop/orm/TestClassWriter.java 1177987

        Diff: https://reviews.apache.org/r/1975/diff

        Testing

        -------

        Thanks,

        Jarek

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/#review2279 ----------------------------------------------------------- Changes look good Jarek. Some feedback noted below: /src/docs/user/import.txt < https://reviews.apache.org/r/1975/#comment5253 > nit:Trailing whitespace. /src/docs/user/import.txt < https://reviews.apache.org/r/1975/#comment5254 > nit:Trailing whitespace. /src/docs/user/import.txt < https://reviews.apache.org/r/1975/#comment5255 > nit:Trailing whitespace. /src/java/com/cloudera/sqoop/SqoopOptions.java < https://reviews.apache.org/r/1975/#comment5294 > Nice to have: Consider storing these as properties object. Look at connectionParams as an example. /src/java/com/cloudera/sqoop/SqoopOptions.java < https://reviews.apache.org/r/1975/#comment5295 > Nice to have: Basic validations here. Perhaps a Class.forName() test to see if the type actually corresponds to an actual object? /src/java/com/cloudera/sqoop/hive/TableDefWriter.java < https://reviews.apache.org/r/1975/#comment5265 > Nice to have: Better to reword this to something like "No column by the name found while importing data". The user may not be aware of what a result set actually is. /src/java/com/cloudera/sqoop/orm/ClassWriter.java < https://reviews.apache.org/r/1975/#comment5293 > This method needs to be overwritten, otherwise the Avro Datafile format will likely not work. See AvroSchemaGenerator.generate() method for details. Arvind On 2011-10-02 08:52:10, Jarek Jarcec wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/ ----------------------------------------------------------- (Updated 2011-10-02 08:52:10) Review request for Sqoop and Arvind Prabhakar. Summary ------- This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch. Things that are missing and I'll add them if this way will be accepted: * Tests * Documentation * Supporting for type names (so that user don't have to type the integer constants on command line) Any feedback will be greatly appreciated. This addresses bug sqoop-342. https://issues.apache.org/jira/browse/sqoop-342 Diffs ----- /src/docs/man/codegen-args.txt 1177987 /src/docs/man/hive-args.txt 1177987 /src/docs/man/import-args.txt 1177987 /src/docs/user/codegen-args.txt 1177987 /src/docs/user/codegen.txt 1177987 /src/docs/user/hive-args.txt 1177987 /src/docs/user/import.txt 1177987 /src/java/com/cloudera/sqoop/SqoopOptions.java 1177987 /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1177987 /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1177987 /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1177987 /src/test/com/cloudera/sqoop/TestSqoopOptions.java 1177987 /src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 1177987 /src/test/com/cloudera/sqoop/orm/TestClassWriter.java 1177987 Diff: https://reviews.apache.org/r/1975/diff Testing ------- Thanks, Jarek
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1975/
        -----------------------------------------------------------

        (Updated 2011-10-02 08:52:10.819110)

        Review request for Sqoop and Arvind Prabhakar.

        Changes
        -------

        Here is updated version of my patch that includes functionality that we agreed on with tests and updated documentation.

        Summary
        -------

        This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch.

        Things that are missing and I'll add them if this way will be accepted:

        • Tests
        • Documentation
        • Supporting for type names (so that user don't have to type the integer constants on command line)

        Any feedback will be greatly appreciated.

        This addresses bug sqoop-342.
        https://issues.apache.org/jira/browse/sqoop-342

        Diffs (updated)


        /src/docs/man/codegen-args.txt 1177987
        /src/docs/man/hive-args.txt 1177987
        /src/docs/man/import-args.txt 1177987
        /src/docs/user/codegen-args.txt 1177987
        /src/docs/user/codegen.txt 1177987
        /src/docs/user/hive-args.txt 1177987
        /src/docs/user/import.txt 1177987
        /src/java/com/cloudera/sqoop/SqoopOptions.java 1177987
        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1177987
        /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1177987
        /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1177987
        /src/test/com/cloudera/sqoop/TestSqoopOptions.java 1177987
        /src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 1177987
        /src/test/com/cloudera/sqoop/orm/TestClassWriter.java 1177987

        Diff: https://reviews.apache.org/r/1975/diff

        Testing
        -------

        Thanks,

        Jarek

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/ ----------------------------------------------------------- (Updated 2011-10-02 08:52:10.819110) Review request for Sqoop and Arvind Prabhakar. Changes ------- Here is updated version of my patch that includes functionality that we agreed on with tests and updated documentation. Summary ------- This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch. Things that are missing and I'll add them if this way will be accepted: Tests Documentation Supporting for type names (so that user don't have to type the integer constants on command line) Any feedback will be greatly appreciated. This addresses bug sqoop-342. https://issues.apache.org/jira/browse/sqoop-342 Diffs (updated) /src/docs/man/codegen-args.txt 1177987 /src/docs/man/hive-args.txt 1177987 /src/docs/man/import-args.txt 1177987 /src/docs/user/codegen-args.txt 1177987 /src/docs/user/codegen.txt 1177987 /src/docs/user/hive-args.txt 1177987 /src/docs/user/import.txt 1177987 /src/java/com/cloudera/sqoop/SqoopOptions.java 1177987 /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1177987 /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1177987 /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1177987 /src/test/com/cloudera/sqoop/TestSqoopOptions.java 1177987 /src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 1177987 /src/test/com/cloudera/sqoop/orm/TestClassWriter.java 1177987 Diff: https://reviews.apache.org/r/1975/diff Testing ------- Thanks, Jarek
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-09-23 02:04:19, Arvind Prabhakar wrote:

        > Thanks for taking this up Jarek. This is a very important functionality extension for Sqoop. Looking at the code change, I feel that you have tried to implement the super-set functionality such that the user can:

        >

        > * Either map SQL types directly to Java/Hive types, or

        > * Map specific columns to Java/Hive types.

        >

        > Between these two, I feel that the later is more relevant use-case for Sqoop consumers. Mapping SQL types to Java/Hive types can be done by extending the Manager and that in itself is not as flexible for the user as the other option of mapping specific columns to a data type. Even when considering the option to map columns to specific data types, the user may not necessarily know what column names Sqoop will use. If these column names do not match, the default mapping will be used silently and that could lead to other problems.

        >

        > Therefore I suggest the following:

        > * Introduce a new option that tells Sqoop to generate a mapping file for the job. This file could be a java properties file that contains the names of columns as read by Sqoop and their default mappings and does not run the actual job.

        > * Introduce another new option that tells Sqoop to use a given mapping file for the job.

        >

        > So the typical workflow would be - if you want to run an import you would do the following:

        > * run: sqoop import --connect .... --genrate-mapping-only /path/to/mapping-file

        > * manually modify the mapping file to override the default types where necessary

        > * run: sqoop import --connect .... --use-mapping-file /path/to/mapping-file

        >

        > What do you think about this approach?

        Jarek Jarcec wrote:

        Hi Arvind,

        thank you for your review. I just noticed that I did not include basic help for my changes, so obviously you have to dig it into source code to find out what and how is working. I'm sorry for that. I'll not forgot to include basic help next time.

        I did not realized that SQL types can be mapped by extending Manager class. My original goal here was to offer user chance to change any type mapping out of the box, without any source code changes. SQL type mapping can be simulated by the column mapping (on precondition that user do know names of of all "problematic" columns), so I don't have problems to implement only the second part as you suggested.

        I think that in most cases the column names are known or user can force them to be known (for example by using "as": SELECT bla-lba-bla AS known_value), so I would say that user can pass the column names without reading the mapping in a normal situation. I don't see a problem to add a option that would generate mapping file that user can read to find out the names used by sqoop, but I would prefer to pass new mapping on command line rather than in separate file. Reason for that is than most of the times, I'm executing sqoop using oozie and in this case it's not easy to guarantee that given property file is located on all nodes.

        What do you think?

        Jarcec

        Arvind Prabhakar wrote:

        Thanks Jarek. I agree with your assessment that you could use projections to tie down the column names, although it will be come more verbose for users who simply want to import a table. I understand your preference of specifying it on the command line for Oozie integration - that makes sense.

        Overall - I think your approach to this is good and we can implement it that way. The one suggestion I have is that if a mapping cannot be applied, it should raise an exception that fails the job, as opposed to defaulting to the built-in mapping. For example if a user specifies a column name that does not exist - that should be an error.

        Jarek Jarcec wrote:

        Hi Arvind,

        let me summarize our conversation so that we're both on the same page. I'll change my patch to support only mapping based on column name which will be specified on command line. I'll throw an exception in case that some mapping will not be used.

        For the second part "creating mapping for columns and save it to output file", I would suggest to create new JIRA bug and place code for that to codegen tool. I think that it is better place for such functionality because import tool should be used for importing data. Stopping import just because user specified parameter "-generate-mapping" doesn't seem to me as good idea. On the other hand codegen tool is meant for generating code, so I would hack it to generate the mapping as well in case that "-generate-mapping" parameter is present.

        Do you have any objections?

        Jarcec

        I agree. Lets implement the type mapping on command line for specific column names that are specified. And error out if a mapping does not get used.

        Also, at this stage, the --generate-mapping does not need to be done. Unless this implementation falls short, there is really no point in doing the same thing in two different ways. So, for now, no need to create a follow-up JIRA or anything.

        Thanks for summarizing and confirming. It greatly reduces confusion!

        • Arvind

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1975/#review2032
        -----------------------------------------------------------

        On 2011-09-20 08:28:11, Jarek Jarcec wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1975/

        -----------------------------------------------------------

        (Updated 2011-09-20 08:28:11)

        Review request for Sqoop and Arvind Prabhakar.

        Summary

        -------

        This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch.

        Things that are missing and I'll add them if this way will be accepted:

        * Tests

        * Documentation

        * Supporting for type names (so that user don't have to type the integer constants on command line)

        Any feedback will be greatly appreciated.

        This addresses bug sqoop-342.

        https://issues.apache.org/jira/browse/sqoop-342

        Diffs

        -----

        /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203

        /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203

        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203

        /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203

        /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203

        /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203

        /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203

        /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203

        Diff: https://reviews.apache.org/r/1975/diff

        Testing

        -------

        Thanks,

        Jarek

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-09-23 02:04:19, Arvind Prabhakar wrote: > Thanks for taking this up Jarek. This is a very important functionality extension for Sqoop. Looking at the code change, I feel that you have tried to implement the super-set functionality such that the user can: > > * Either map SQL types directly to Java/Hive types, or > * Map specific columns to Java/Hive types. > > Between these two, I feel that the later is more relevant use-case for Sqoop consumers. Mapping SQL types to Java/Hive types can be done by extending the Manager and that in itself is not as flexible for the user as the other option of mapping specific columns to a data type. Even when considering the option to map columns to specific data types, the user may not necessarily know what column names Sqoop will use. If these column names do not match, the default mapping will be used silently and that could lead to other problems. > > Therefore I suggest the following: > * Introduce a new option that tells Sqoop to generate a mapping file for the job. This file could be a java properties file that contains the names of columns as read by Sqoop and their default mappings and does not run the actual job. > * Introduce another new option that tells Sqoop to use a given mapping file for the job. > > So the typical workflow would be - if you want to run an import you would do the following: > * run: sqoop import --connect .... --genrate-mapping-only /path/to/mapping-file > * manually modify the mapping file to override the default types where necessary > * run: sqoop import --connect .... --use-mapping-file /path/to/mapping-file > > What do you think about this approach? Jarek Jarcec wrote: Hi Arvind, thank you for your review. I just noticed that I did not include basic help for my changes, so obviously you have to dig it into source code to find out what and how is working. I'm sorry for that. I'll not forgot to include basic help next time. I did not realized that SQL types can be mapped by extending Manager class. My original goal here was to offer user chance to change any type mapping out of the box, without any source code changes. SQL type mapping can be simulated by the column mapping (on precondition that user do know names of of all "problematic" columns), so I don't have problems to implement only the second part as you suggested. I think that in most cases the column names are known or user can force them to be known (for example by using "as": SELECT bla-lba-bla AS known_value), so I would say that user can pass the column names without reading the mapping in a normal situation. I don't see a problem to add a option that would generate mapping file that user can read to find out the names used by sqoop, but I would prefer to pass new mapping on command line rather than in separate file. Reason for that is than most of the times, I'm executing sqoop using oozie and in this case it's not easy to guarantee that given property file is located on all nodes. What do you think? Jarcec Arvind Prabhakar wrote: Thanks Jarek. I agree with your assessment that you could use projections to tie down the column names, although it will be come more verbose for users who simply want to import a table. I understand your preference of specifying it on the command line for Oozie integration - that makes sense. Overall - I think your approach to this is good and we can implement it that way. The one suggestion I have is that if a mapping cannot be applied, it should raise an exception that fails the job, as opposed to defaulting to the built-in mapping. For example if a user specifies a column name that does not exist - that should be an error. Jarek Jarcec wrote: Hi Arvind, let me summarize our conversation so that we're both on the same page. I'll change my patch to support only mapping based on column name which will be specified on command line. I'll throw an exception in case that some mapping will not be used. For the second part "creating mapping for columns and save it to output file", I would suggest to create new JIRA bug and place code for that to codegen tool. I think that it is better place for such functionality because import tool should be used for importing data. Stopping import just because user specified parameter "- generate-mapping" doesn't seem to me as good idea. On the other hand codegen tool is meant for generating code, so I would hack it to generate the mapping as well in case that " -generate-mapping" parameter is present. Do you have any objections? Jarcec I agree. Lets implement the type mapping on command line for specific column names that are specified. And error out if a mapping does not get used. Also, at this stage, the --generate-mapping does not need to be done. Unless this implementation falls short, there is really no point in doing the same thing in two different ways. So, for now, no need to create a follow-up JIRA or anything. Thanks for summarizing and confirming. It greatly reduces confusion! Arvind ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/#review2032 ----------------------------------------------------------- On 2011-09-20 08:28:11, Jarek Jarcec wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/ ----------------------------------------------------------- (Updated 2011-09-20 08:28:11) Review request for Sqoop and Arvind Prabhakar. Summary ------- This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch. Things that are missing and I'll add them if this way will be accepted: * Tests * Documentation * Supporting for type names (so that user don't have to type the integer constants on command line) Any feedback will be greatly appreciated. This addresses bug sqoop-342. https://issues.apache.org/jira/browse/sqoop-342 Diffs ----- /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203 /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203 /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203 /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203 /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203 /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203 /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203 /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203 Diff: https://reviews.apache.org/r/1975/diff Testing ------- Thanks, Jarek
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-09-23 02:04:19, Arvind Prabhakar wrote:

        > Thanks for taking this up Jarek. This is a very important functionality extension for Sqoop. Looking at the code change, I feel that you have tried to implement the super-set functionality such that the user can:

        >

        > * Either map SQL types directly to Java/Hive types, or

        > * Map specific columns to Java/Hive types.

        >

        > Between these two, I feel that the later is more relevant use-case for Sqoop consumers. Mapping SQL types to Java/Hive types can be done by extending the Manager and that in itself is not as flexible for the user as the other option of mapping specific columns to a data type. Even when considering the option to map columns to specific data types, the user may not necessarily know what column names Sqoop will use. If these column names do not match, the default mapping will be used silently and that could lead to other problems.

        >

        > Therefore I suggest the following:

        > * Introduce a new option that tells Sqoop to generate a mapping file for the job. This file could be a java properties file that contains the names of columns as read by Sqoop and their default mappings and does not run the actual job.

        > * Introduce another new option that tells Sqoop to use a given mapping file for the job.

        >

        > So the typical workflow would be - if you want to run an import you would do the following:

        > * run: sqoop import --connect .... --genrate-mapping-only /path/to/mapping-file

        > * manually modify the mapping file to override the default types where necessary

        > * run: sqoop import --connect .... --use-mapping-file /path/to/mapping-file

        >

        > What do you think about this approach?

        Jarek Jarcec wrote:

        Hi Arvind,

        thank you for your review. I just noticed that I did not include basic help for my changes, so obviously you have to dig it into source code to find out what and how is working. I'm sorry for that. I'll not forgot to include basic help next time.

        I did not realized that SQL types can be mapped by extending Manager class. My original goal here was to offer user chance to change any type mapping out of the box, without any source code changes. SQL type mapping can be simulated by the column mapping (on precondition that user do know names of of all "problematic" columns), so I don't have problems to implement only the second part as you suggested.

        I think that in most cases the column names are known or user can force them to be known (for example by using "as": SELECT bla-lba-bla AS known_value), so I would say that user can pass the column names without reading the mapping in a normal situation. I don't see a problem to add a option that would generate mapping file that user can read to find out the names used by sqoop, but I would prefer to pass new mapping on command line rather than in separate file. Reason for that is than most of the times, I'm executing sqoop using oozie and in this case it's not easy to guarantee that given property file is located on all nodes.

        What do you think?

        Jarcec

        Arvind Prabhakar wrote:

        Thanks Jarek. I agree with your assessment that you could use projections to tie down the column names, although it will be come more verbose for users who simply want to import a table. I understand your preference of specifying it on the command line for Oozie integration - that makes sense.

        Overall - I think your approach to this is good and we can implement it that way. The one suggestion I have is that if a mapping cannot be applied, it should raise an exception that fails the job, as opposed to defaulting to the built-in mapping. For example if a user specifies a column name that does not exist - that should be an error.

        Hi Arvind,
        let me summarize our conversation so that we're both on the same page. I'll change my patch to support only mapping based on column name which will be specified on command line. I'll throw an exception in case that some mapping will not be used.

        For the second part "creating mapping for columns and save it to output file", I would suggest to create new JIRA bug and place code for that to codegen tool. I think that it is better place for such functionality because import tool should be used for importing data. Stopping import just because user specified parameter "-generate-mapping" doesn't seem to me as good idea. On the other hand codegen tool is meant for generating code, so I would hack it to generate the mapping as well in case that "-generate-mapping" parameter is present.

        Do you have any objections?

        Jarcec

        • Jarek

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1975/#review2032
        -----------------------------------------------------------

        On 2011-09-20 08:28:11, Jarek Jarcec wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1975/

        -----------------------------------------------------------

        (Updated 2011-09-20 08:28:11)

        Review request for Sqoop and Arvind Prabhakar.

        Summary

        -------

        This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch.

        Things that are missing and I'll add them if this way will be accepted:

        * Tests

        * Documentation

        * Supporting for type names (so that user don't have to type the integer constants on command line)

        Any feedback will be greatly appreciated.

        This addresses bug sqoop-342.

        https://issues.apache.org/jira/browse/sqoop-342

        Diffs

        -----

        /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203

        /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203

        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203

        /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203

        /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203

        /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203

        /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203

        /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203

        Diff: https://reviews.apache.org/r/1975/diff

        Testing

        -------

        Thanks,

        Jarek

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-09-23 02:04:19, Arvind Prabhakar wrote: > Thanks for taking this up Jarek. This is a very important functionality extension for Sqoop. Looking at the code change, I feel that you have tried to implement the super-set functionality such that the user can: > > * Either map SQL types directly to Java/Hive types, or > * Map specific columns to Java/Hive types. > > Between these two, I feel that the later is more relevant use-case for Sqoop consumers. Mapping SQL types to Java/Hive types can be done by extending the Manager and that in itself is not as flexible for the user as the other option of mapping specific columns to a data type. Even when considering the option to map columns to specific data types, the user may not necessarily know what column names Sqoop will use. If these column names do not match, the default mapping will be used silently and that could lead to other problems. > > Therefore I suggest the following: > * Introduce a new option that tells Sqoop to generate a mapping file for the job. This file could be a java properties file that contains the names of columns as read by Sqoop and their default mappings and does not run the actual job. > * Introduce another new option that tells Sqoop to use a given mapping file for the job. > > So the typical workflow would be - if you want to run an import you would do the following: > * run: sqoop import --connect .... --genrate-mapping-only /path/to/mapping-file > * manually modify the mapping file to override the default types where necessary > * run: sqoop import --connect .... --use-mapping-file /path/to/mapping-file > > What do you think about this approach? Jarek Jarcec wrote: Hi Arvind, thank you for your review. I just noticed that I did not include basic help for my changes, so obviously you have to dig it into source code to find out what and how is working. I'm sorry for that. I'll not forgot to include basic help next time. I did not realized that SQL types can be mapped by extending Manager class. My original goal here was to offer user chance to change any type mapping out of the box, without any source code changes. SQL type mapping can be simulated by the column mapping (on precondition that user do know names of of all "problematic" columns), so I don't have problems to implement only the second part as you suggested. I think that in most cases the column names are known or user can force them to be known (for example by using "as": SELECT bla-lba-bla AS known_value), so I would say that user can pass the column names without reading the mapping in a normal situation. I don't see a problem to add a option that would generate mapping file that user can read to find out the names used by sqoop, but I would prefer to pass new mapping on command line rather than in separate file. Reason for that is than most of the times, I'm executing sqoop using oozie and in this case it's not easy to guarantee that given property file is located on all nodes. What do you think? Jarcec Arvind Prabhakar wrote: Thanks Jarek. I agree with your assessment that you could use projections to tie down the column names, although it will be come more verbose for users who simply want to import a table. I understand your preference of specifying it on the command line for Oozie integration - that makes sense. Overall - I think your approach to this is good and we can implement it that way. The one suggestion I have is that if a mapping cannot be applied, it should raise an exception that fails the job, as opposed to defaulting to the built-in mapping. For example if a user specifies a column name that does not exist - that should be an error. Hi Arvind, let me summarize our conversation so that we're both on the same page. I'll change my patch to support only mapping based on column name which will be specified on command line. I'll throw an exception in case that some mapping will not be used. For the second part "creating mapping for columns and save it to output file", I would suggest to create new JIRA bug and place code for that to codegen tool. I think that it is better place for such functionality because import tool should be used for importing data. Stopping import just because user specified parameter "- generate-mapping" doesn't seem to me as good idea. On the other hand codegen tool is meant for generating code, so I would hack it to generate the mapping as well in case that " -generate-mapping" parameter is present. Do you have any objections? Jarcec Jarek ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/#review2032 ----------------------------------------------------------- On 2011-09-20 08:28:11, Jarek Jarcec wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/ ----------------------------------------------------------- (Updated 2011-09-20 08:28:11) Review request for Sqoop and Arvind Prabhakar. Summary ------- This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch. Things that are missing and I'll add them if this way will be accepted: * Tests * Documentation * Supporting for type names (so that user don't have to type the integer constants on command line) Any feedback will be greatly appreciated. This addresses bug sqoop-342. https://issues.apache.org/jira/browse/sqoop-342 Diffs ----- /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203 /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203 /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203 /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203 /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203 /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203 /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203 /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203 Diff: https://reviews.apache.org/r/1975/diff Testing ------- Thanks, Jarek
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-09-23 02:04:19, Arvind Prabhakar wrote:

        > Thanks for taking this up Jarek. This is a very important functionality extension for Sqoop. Looking at the code change, I feel that you have tried to implement the super-set functionality such that the user can:

        >

        > * Either map SQL types directly to Java/Hive types, or

        > * Map specific columns to Java/Hive types.

        >

        > Between these two, I feel that the later is more relevant use-case for Sqoop consumers. Mapping SQL types to Java/Hive types can be done by extending the Manager and that in itself is not as flexible for the user as the other option of mapping specific columns to a data type. Even when considering the option to map columns to specific data types, the user may not necessarily know what column names Sqoop will use. If these column names do not match, the default mapping will be used silently and that could lead to other problems.

        >

        > Therefore I suggest the following:

        > * Introduce a new option that tells Sqoop to generate a mapping file for the job. This file could be a java properties file that contains the names of columns as read by Sqoop and their default mappings and does not run the actual job.

        > * Introduce another new option that tells Sqoop to use a given mapping file for the job.

        >

        > So the typical workflow would be - if you want to run an import you would do the following:

        > * run: sqoop import --connect .... --genrate-mapping-only /path/to/mapping-file

        > * manually modify the mapping file to override the default types where necessary

        > * run: sqoop import --connect .... --use-mapping-file /path/to/mapping-file

        >

        > What do you think about this approach?

        Jarek Jarcec wrote:

        Hi Arvind,

        thank you for your review. I just noticed that I did not include basic help for my changes, so obviously you have to dig it into source code to find out what and how is working. I'm sorry for that. I'll not forgot to include basic help next time.

        I did not realized that SQL types can be mapped by extending Manager class. My original goal here was to offer user chance to change any type mapping out of the box, without any source code changes. SQL type mapping can be simulated by the column mapping (on precondition that user do know names of of all "problematic" columns), so I don't have problems to implement only the second part as you suggested.

        I think that in most cases the column names are known or user can force them to be known (for example by using "as": SELECT bla-lba-bla AS known_value), so I would say that user can pass the column names without reading the mapping in a normal situation. I don't see a problem to add a option that would generate mapping file that user can read to find out the names used by sqoop, but I would prefer to pass new mapping on command line rather than in separate file. Reason for that is than most of the times, I'm executing sqoop using oozie and in this case it's not easy to guarantee that given property file is located on all nodes.

        What do you think?

        Jarcec

        Thanks Jarek. I agree with your assessment that you could use projections to tie down the column names, although it will be come more verbose for users who simply want to import a table. I understand your preference of specifying it on the command line for Oozie integration - that makes sense.

        Overall - I think your approach to this is good and we can implement it that way. The one suggestion I have is that if a mapping cannot be applied, it should raise an exception that fails the job, as opposed to defaulting to the built-in mapping. For example if a user specifies a column name that does not exist - that should be an error.

        • Arvind

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1975/#review2032
        -----------------------------------------------------------

        On 2011-09-20 08:28:11, Jarek Jarcec wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1975/

        -----------------------------------------------------------

        (Updated 2011-09-20 08:28:11)

        Review request for Sqoop and Arvind Prabhakar.

        Summary

        -------

        This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch.

        Things that are missing and I'll add them if this way will be accepted:

        * Tests

        * Documentation

        * Supporting for type names (so that user don't have to type the integer constants on command line)

        Any feedback will be greatly appreciated.

        This addresses bug sqoop-342.

        https://issues.apache.org/jira/browse/sqoop-342

        Diffs

        -----

        /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203

        /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203

        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203

        /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203

        /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203

        /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203

        /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203

        /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203

        Diff: https://reviews.apache.org/r/1975/diff

        Testing

        -------

        Thanks,

        Jarek

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-09-23 02:04:19, Arvind Prabhakar wrote: > Thanks for taking this up Jarek. This is a very important functionality extension for Sqoop. Looking at the code change, I feel that you have tried to implement the super-set functionality such that the user can: > > * Either map SQL types directly to Java/Hive types, or > * Map specific columns to Java/Hive types. > > Between these two, I feel that the later is more relevant use-case for Sqoop consumers. Mapping SQL types to Java/Hive types can be done by extending the Manager and that in itself is not as flexible for the user as the other option of mapping specific columns to a data type. Even when considering the option to map columns to specific data types, the user may not necessarily know what column names Sqoop will use. If these column names do not match, the default mapping will be used silently and that could lead to other problems. > > Therefore I suggest the following: > * Introduce a new option that tells Sqoop to generate a mapping file for the job. This file could be a java properties file that contains the names of columns as read by Sqoop and their default mappings and does not run the actual job. > * Introduce another new option that tells Sqoop to use a given mapping file for the job. > > So the typical workflow would be - if you want to run an import you would do the following: > * run: sqoop import --connect .... --genrate-mapping-only /path/to/mapping-file > * manually modify the mapping file to override the default types where necessary > * run: sqoop import --connect .... --use-mapping-file /path/to/mapping-file > > What do you think about this approach? Jarek Jarcec wrote: Hi Arvind, thank you for your review. I just noticed that I did not include basic help for my changes, so obviously you have to dig it into source code to find out what and how is working. I'm sorry for that. I'll not forgot to include basic help next time. I did not realized that SQL types can be mapped by extending Manager class. My original goal here was to offer user chance to change any type mapping out of the box, without any source code changes. SQL type mapping can be simulated by the column mapping (on precondition that user do know names of of all "problematic" columns), so I don't have problems to implement only the second part as you suggested. I think that in most cases the column names are known or user can force them to be known (for example by using "as": SELECT bla-lba-bla AS known_value), so I would say that user can pass the column names without reading the mapping in a normal situation. I don't see a problem to add a option that would generate mapping file that user can read to find out the names used by sqoop, but I would prefer to pass new mapping on command line rather than in separate file. Reason for that is than most of the times, I'm executing sqoop using oozie and in this case it's not easy to guarantee that given property file is located on all nodes. What do you think? Jarcec Thanks Jarek. I agree with your assessment that you could use projections to tie down the column names, although it will be come more verbose for users who simply want to import a table. I understand your preference of specifying it on the command line for Oozie integration - that makes sense. Overall - I think your approach to this is good and we can implement it that way. The one suggestion I have is that if a mapping cannot be applied, it should raise an exception that fails the job, as opposed to defaulting to the built-in mapping. For example if a user specifies a column name that does not exist - that should be an error. Arvind ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/#review2032 ----------------------------------------------------------- On 2011-09-20 08:28:11, Jarek Jarcec wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/ ----------------------------------------------------------- (Updated 2011-09-20 08:28:11) Review request for Sqoop and Arvind Prabhakar. Summary ------- This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch. Things that are missing and I'll add them if this way will be accepted: * Tests * Documentation * Supporting for type names (so that user don't have to type the integer constants on command line) Any feedback will be greatly appreciated. This addresses bug sqoop-342. https://issues.apache.org/jira/browse/sqoop-342 Diffs ----- /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203 /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203 /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203 /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203 /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203 /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203 /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203 /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203 Diff: https://reviews.apache.org/r/1975/diff Testing ------- Thanks, Jarek
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-09-23 02:04:19, Arvind Prabhakar wrote:

        > Thanks for taking this up Jarek. This is a very important functionality extension for Sqoop. Looking at the code change, I feel that you have tried to implement the super-set functionality such that the user can:

        >

        > * Either map SQL types directly to Java/Hive types, or

        > * Map specific columns to Java/Hive types.

        >

        > Between these two, I feel that the later is more relevant use-case for Sqoop consumers. Mapping SQL types to Java/Hive types can be done by extending the Manager and that in itself is not as flexible for the user as the other option of mapping specific columns to a data type. Even when considering the option to map columns to specific data types, the user may not necessarily know what column names Sqoop will use. If these column names do not match, the default mapping will be used silently and that could lead to other problems.

        >

        > Therefore I suggest the following:

        > * Introduce a new option that tells Sqoop to generate a mapping file for the job. This file could be a java properties file that contains the names of columns as read by Sqoop and their default mappings and does not run the actual job.

        > * Introduce another new option that tells Sqoop to use a given mapping file for the job.

        >

        > So the typical workflow would be - if you want to run an import you would do the following:

        > * run: sqoop import --connect .... --genrate-mapping-only /path/to/mapping-file

        > * manually modify the mapping file to override the default types where necessary

        > * run: sqoop import --connect .... --use-mapping-file /path/to/mapping-file

        >

        > What do you think about this approach?

        Hi Arvind,
        thank you for your review. I just noticed that I did not include basic help for my changes, so obviously you have to dig it into source code to find out what and how is working. I'm sorry for that. I'll not forgot to include basic help next time.

        I did not realized that SQL types can be mapped by extending Manager class. My original goal here was to offer user chance to change any type mapping out of the box, without any source code changes. SQL type mapping can be simulated by the column mapping (on precondition that user do know names of of all "problematic" columns), so I don't have problems to implement only the second part as you suggested.

        I think that in most cases the column names are known or user can force them to be known (for example by using "as": SELECT bla-lba-bla AS known_value), so I would say that user can pass the column names without reading the mapping in a normal situation. I don't see a problem to add a option that would generate mapping file that user can read to find out the names used by sqoop, but I would prefer to pass new mapping on command line rather than in separate file. Reason for that is than most of the times, I'm executing sqoop using oozie and in this case it's not easy to guarantee that given property file is located on all nodes.

        What do you think?

        Jarcec

        • Jarek

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1975/#review2032
        -----------------------------------------------------------

        On 2011-09-20 08:28:11, Jarek Jarcec wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1975/

        -----------------------------------------------------------

        (Updated 2011-09-20 08:28:11)

        Review request for Sqoop and Arvind Prabhakar.

        Summary

        -------

        This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch.

        Things that are missing and I'll add them if this way will be accepted:

        * Tests

        * Documentation

        * Supporting for type names (so that user don't have to type the integer constants on command line)

        Any feedback will be greatly appreciated.

        This addresses bug sqoop-342.

        https://issues.apache.org/jira/browse/sqoop-342

        Diffs

        -----

        /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203

        /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203

        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203

        /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203

        /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203

        /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203

        /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203

        /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203

        Diff: https://reviews.apache.org/r/1975/diff

        Testing

        -------

        Thanks,

        Jarek

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-09-23 02:04:19, Arvind Prabhakar wrote: > Thanks for taking this up Jarek. This is a very important functionality extension for Sqoop. Looking at the code change, I feel that you have tried to implement the super-set functionality such that the user can: > > * Either map SQL types directly to Java/Hive types, or > * Map specific columns to Java/Hive types. > > Between these two, I feel that the later is more relevant use-case for Sqoop consumers. Mapping SQL types to Java/Hive types can be done by extending the Manager and that in itself is not as flexible for the user as the other option of mapping specific columns to a data type. Even when considering the option to map columns to specific data types, the user may not necessarily know what column names Sqoop will use. If these column names do not match, the default mapping will be used silently and that could lead to other problems. > > Therefore I suggest the following: > * Introduce a new option that tells Sqoop to generate a mapping file for the job. This file could be a java properties file that contains the names of columns as read by Sqoop and their default mappings and does not run the actual job. > * Introduce another new option that tells Sqoop to use a given mapping file for the job. > > So the typical workflow would be - if you want to run an import you would do the following: > * run: sqoop import --connect .... --genrate-mapping-only /path/to/mapping-file > * manually modify the mapping file to override the default types where necessary > * run: sqoop import --connect .... --use-mapping-file /path/to/mapping-file > > What do you think about this approach? Hi Arvind, thank you for your review. I just noticed that I did not include basic help for my changes, so obviously you have to dig it into source code to find out what and how is working. I'm sorry for that. I'll not forgot to include basic help next time. I did not realized that SQL types can be mapped by extending Manager class. My original goal here was to offer user chance to change any type mapping out of the box, without any source code changes. SQL type mapping can be simulated by the column mapping (on precondition that user do know names of of all "problematic" columns), so I don't have problems to implement only the second part as you suggested. I think that in most cases the column names are known or user can force them to be known (for example by using "as": SELECT bla-lba-bla AS known_value), so I would say that user can pass the column names without reading the mapping in a normal situation. I don't see a problem to add a option that would generate mapping file that user can read to find out the names used by sqoop, but I would prefer to pass new mapping on command line rather than in separate file. Reason for that is than most of the times, I'm executing sqoop using oozie and in this case it's not easy to guarantee that given property file is located on all nodes. What do you think? Jarcec Jarek ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/#review2032 ----------------------------------------------------------- On 2011-09-20 08:28:11, Jarek Jarcec wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/ ----------------------------------------------------------- (Updated 2011-09-20 08:28:11) Review request for Sqoop and Arvind Prabhakar. Summary ------- This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch. Things that are missing and I'll add them if this way will be accepted: * Tests * Documentation * Supporting for type names (so that user don't have to type the integer constants on command line) Any feedback will be greatly appreciated. This addresses bug sqoop-342. https://issues.apache.org/jira/browse/sqoop-342 Diffs ----- /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203 /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203 /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203 /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203 /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203 /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203 /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203 /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203 Diff: https://reviews.apache.org/r/1975/diff Testing ------- Thanks, Jarek
        Hide
        YoungWoo Kim added a comment -

        I would like to suggest 'precision and scale aware' type mappings. As of now, sqoop generate type mappings with rules that just map sql types to hive types. For instance, one create a table which contains 'NUMBER(5,0)' column in Oracle, then the column is mapped as 'DOUBLE' in Hive. but it should be mapped 'INT'type for Hive. We can get precision and scale of resultset from JDBC metadata. so I believe that the type mappings from DB to Hive for JDBC NUMERIC and DECIMAL should be precision and scale aware.

        Show
        YoungWoo Kim added a comment - I would like to suggest 'precision and scale aware' type mappings. As of now, sqoop generate type mappings with rules that just map sql types to hive types. For instance, one create a table which contains 'NUMBER(5,0)' column in Oracle, then the column is mapped as 'DOUBLE' in Hive. but it should be mapped 'INT'type for Hive. We can get precision and scale of resultset from JDBC metadata. so I believe that the type mappings from DB to Hive for JDBC NUMERIC and DECIMAL should be precision and scale aware.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1975/#review2032
        -----------------------------------------------------------

        Thanks for taking this up Jarek. This is a very important functionality extension for Sqoop. Looking at the code change, I feel that you have tried to implement the super-set functionality such that the user can:

        • Either map SQL types directly to Java/Hive types, or
        • Map specific columns to Java/Hive types.

        Between these two, I feel that the later is more relevant use-case for Sqoop consumers. Mapping SQL types to Java/Hive types can be done by extending the Manager and that in itself is not as flexible for the user as the other option of mapping specific columns to a data type. Even when considering the option to map columns to specific data types, the user may not necessarily know what column names Sqoop will use. If these column names do not match, the default mapping will be used silently and that could lead to other problems.

        Therefore I suggest the following:

        • Introduce a new option that tells Sqoop to generate a mapping file for the job. This file could be a java properties file that contains the names of columns as read by Sqoop and their default mappings and does not run the actual job.
        • Introduce another new option that tells Sqoop to use a given mapping file for the job.

        So the typical workflow would be - if you want to run an import you would do the following:

        • run: sqoop import --connect .... --genrate-mapping-only /path/to/mapping-file
        • manually modify the mapping file to override the default types where necessary
        • run: sqoop import --connect .... --use-mapping-file /path/to/mapping-file

        What do you think about this approach?

        • Arvind

        On 2011-09-20 08:28:11, Jarek Jarcec wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1975/

        -----------------------------------------------------------

        (Updated 2011-09-20 08:28:11)

        Review request for Sqoop and Arvind Prabhakar.

        Summary

        -------

        This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch.

        Things that are missing and I'll add them if this way will be accepted:

        * Tests

        * Documentation

        * Supporting for type names (so that user don't have to type the integer constants on command line)

        Any feedback will be greatly appreciated.

        This addresses bug sqoop-342.

        https://issues.apache.org/jira/browse/sqoop-342

        Diffs

        -----

        /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203

        /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203

        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203

        /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203

        /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203

        /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203

        /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203

        /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203

        Diff: https://reviews.apache.org/r/1975/diff

        Testing

        -------

        Thanks,

        Jarek

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/#review2032 ----------------------------------------------------------- Thanks for taking this up Jarek. This is a very important functionality extension for Sqoop. Looking at the code change, I feel that you have tried to implement the super-set functionality such that the user can: Either map SQL types directly to Java/Hive types, or Map specific columns to Java/Hive types. Between these two, I feel that the later is more relevant use-case for Sqoop consumers. Mapping SQL types to Java/Hive types can be done by extending the Manager and that in itself is not as flexible for the user as the other option of mapping specific columns to a data type. Even when considering the option to map columns to specific data types, the user may not necessarily know what column names Sqoop will use. If these column names do not match, the default mapping will be used silently and that could lead to other problems. Therefore I suggest the following: Introduce a new option that tells Sqoop to generate a mapping file for the job. This file could be a java properties file that contains the names of columns as read by Sqoop and their default mappings and does not run the actual job. Introduce another new option that tells Sqoop to use a given mapping file for the job. So the typical workflow would be - if you want to run an import you would do the following: run: sqoop import --connect .... --genrate-mapping-only /path/to/mapping-file manually modify the mapping file to override the default types where necessary run: sqoop import --connect .... --use-mapping-file /path/to/mapping-file What do you think about this approach? Arvind On 2011-09-20 08:28:11, Jarek Jarcec wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/ ----------------------------------------------------------- (Updated 2011-09-20 08:28:11) Review request for Sqoop and Arvind Prabhakar. Summary ------- This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch. Things that are missing and I'll add them if this way will be accepted: * Tests * Documentation * Supporting for type names (so that user don't have to type the integer constants on command line) Any feedback will be greatly appreciated. This addresses bug sqoop-342. https://issues.apache.org/jira/browse/sqoop-342 Diffs ----- /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203 /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203 /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203 /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203 /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203 /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203 /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203 /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203 Diff: https://reviews.apache.org/r/1975/diff Testing ------- Thanks, Jarek
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1975/
        -----------------------------------------------------------

        Review request for Sqoop and Arvind Prabhakar.

        Summary
        -------

        This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch.

        Things that are missing and I'll add them if this way will be accepted:

        • Tests
        • Documentation
        • Supporting for type names (so that user don't have to type the integer constants on command line)

        Any feedback will be greatly appreciated.

        This addresses bug sqoop-342.
        https://issues.apache.org/jira/browse/sqoop-342

        Diffs


        /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203
        /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203
        /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203
        /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203
        /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203
        /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203
        /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203
        /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203

        Diff: https://reviews.apache.org/r/1975/diff

        Testing
        -------

        Thanks,

        Jarek

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1975/ ----------------------------------------------------------- Review request for Sqoop and Arvind Prabhakar. Summary ------- This is not fully featured patch yet, it's more only preview of what I have in my mind when I created the bug and how would I image to solve it. I would like to check with community whether this is acceptable solution and if so, I'll finish the patch. Things that are missing and I'll add them if this way will be accepted: Tests Documentation Supporting for type names (so that user don't have to type the integer constants on command line) Any feedback will be greatly appreciated. This addresses bug sqoop-342. https://issues.apache.org/jira/browse/sqoop-342 Diffs /src/java/com/cloudera/sqoop/SqoopOptions.java 1172203 /src/java/com/cloudera/sqoop/hive/HiveTypes.java 1172203 /src/java/com/cloudera/sqoop/hive/TableDefWriter.java 1172203 /src/java/com/cloudera/sqoop/manager/ConnManager.java 1172203 /src/java/com/cloudera/sqoop/manager/SqlManager.java 1172203 /src/java/com/cloudera/sqoop/mapreduce/JdbcExportJob.java 1172203 /src/java/com/cloudera/sqoop/orm/ClassWriter.java 1172203 /src/java/com/cloudera/sqoop/tool/BaseSqoopTool.java 1172203 Diff: https://reviews.apache.org/r/1975/diff Testing ------- Thanks, Jarek

          People

          • Assignee:
            Jarek Jarcec Cecho
            Reporter:
            Jarek Jarcec Cecho
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development