Sqoop
  1. Sqoop
  2. SQOOP-318

Add support for splittable lzo files with Hive

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.4.0-incubating
    • Component/s: hive-integration
    • Labels:
      None

      Description

      When importing LZO compressed files into Hive, it would be useful to create the hive table with the com.hadoop.mapred.DeprecatedLzoTextInputFormat. It would also be nice to automatically run the DistributedIndexer so that the LZO files can be split.

      1. SQOOP-318-2.patch
        8 kB
        Joey Echeverria
      2. SQOOP-318-1.patch
        6 kB
        Joey Echeverria

        Activity

        Hide
        Hudson added a comment -

        Integrated in Sqoop-jdk-1.6 #16 (See https://builds.apache.org/job/Sqoop-jdk-1.6/16/)
        SQOOP-318. Support splittable LZO files with Hive.

        (Joey Echeverria via Arvind Prabhakar)

        arvind : http://svn.apache.org/viewvc/?view=rev&rev=1160815
        Files :

        • /incubator/sqoop/trunk/src/docs/user/hive.txt
        • /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/io/CodecMap.java
        • /incubator/sqoop/trunk/src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java
        • /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/hive/TableDefWriter.java
        • /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/hive/HiveImport.java
        Show
        Hudson added a comment - Integrated in Sqoop-jdk-1.6 #16 (See https://builds.apache.org/job/Sqoop-jdk-1.6/16/ ) SQOOP-318 . Support splittable LZO files with Hive. (Joey Echeverria via Arvind Prabhakar) arvind : http://svn.apache.org/viewvc/?view=rev&rev=1160815 Files : /incubator/sqoop/trunk/src/docs/user/hive.txt /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/io/CodecMap.java /incubator/sqoop/trunk/src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/hive/TableDefWriter.java /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/hive/HiveImport.java
        Arvind Prabhakar made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 1.4.0 [ 12317345 ]
        Resolution Fixed [ 1 ]
        Hide
        Arvind Prabhakar added a comment -

        Patch committed. Thanks Joey!

        Show
        Arvind Prabhakar added a comment - Patch committed. Thanks Joey!
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1597/#review1599
        -----------------------------------------------------------

        Ship it!

        +1

        • Arvind

        On 2011-08-22 23:01:36, Joey Echeverria wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1597/

        -----------------------------------------------------------

        (Updated 2011-08-22 23:01:36)

        Review request for Sqoop.

        Summary

        -------

        I added a check when generating the create table string to see if the LzopCodec is in use. If it is, it outputs

        STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"

        OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

        at the end of the create table command, otherwise it outputs the standard

        STORED AS TEXTFILE

        I also added a call to the DistributedLzoIndexer before the data is imported into Hive.

        This addresses bug SQOOP-318.

        https://issues.apache.org/jira/browse/SQOOP-318

        Diffs

        -----

        src/docs/user/hive.txt 059d7cb

        src/java/com/cloudera/sqoop/hive/HiveImport.java 36c17ba

        src/java/com/cloudera/sqoop/hive/TableDefWriter.java 7dd9135

        src/java/com/cloudera/sqoop/io/CodecMap.java 8564164

        src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 43b755e

        Diff: https://reviews.apache.org/r/1597/diff

        Testing

        -------

        It includes a test for the create table syntax. I manually tested calling the indexer. I'm not sure how to automate that without making LZO required to build.

        Thanks,

        Joey

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1597/#review1599 ----------------------------------------------------------- Ship it! +1 Arvind On 2011-08-22 23:01:36, Joey Echeverria wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1597/ ----------------------------------------------------------- (Updated 2011-08-22 23:01:36) Review request for Sqoop. Summary ------- I added a check when generating the create table string to see if the LzopCodec is in use. If it is, it outputs STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" at the end of the create table command, otherwise it outputs the standard STORED AS TEXTFILE I also added a call to the DistributedLzoIndexer before the data is imported into Hive. This addresses bug SQOOP-318 . https://issues.apache.org/jira/browse/SQOOP-318 Diffs ----- src/docs/user/hive.txt 059d7cb src/java/com/cloudera/sqoop/hive/HiveImport.java 36c17ba src/java/com/cloudera/sqoop/hive/TableDefWriter.java 7dd9135 src/java/com/cloudera/sqoop/io/CodecMap.java 8564164 src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 43b755e Diff: https://reviews.apache.org/r/1597/diff Testing ------- It includes a test for the create table syntax. I manually tested calling the indexer. I'm not sure how to automate that without making LZO required to build. Thanks, Joey
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1597/
        -----------------------------------------------------------

        (Updated 2011-08-22 23:01:36.319406)

        Review request for Sqoop.

        Changes
        -------

        I added lzop to the CodecMap and modified the tests to reference the codec with the short name. I added a blurb at the end of the Hive documentation describing the splitting you get with the lzop codec. I also fixed the checkstyle issues.

        Summary
        -------

        I added a check when generating the create table string to see if the LzopCodec is in use. If it is, it outputs

        STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
        OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

        at the end of the create table command, otherwise it outputs the standard

        STORED AS TEXTFILE

        I also added a call to the DistributedLzoIndexer before the data is imported into Hive.

        This addresses bug SQOOP-318.
        https://issues.apache.org/jira/browse/SQOOP-318

        Diffs (updated)


        src/docs/user/hive.txt 059d7cb
        src/java/com/cloudera/sqoop/hive/HiveImport.java 36c17ba
        src/java/com/cloudera/sqoop/hive/TableDefWriter.java 7dd9135
        src/java/com/cloudera/sqoop/io/CodecMap.java 8564164
        src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 43b755e

        Diff: https://reviews.apache.org/r/1597/diff

        Testing
        -------

        It includes a test for the create table syntax. I manually tested calling the indexer. I'm not sure how to automate that without making LZO required to build.

        Thanks,

        Joey

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1597/ ----------------------------------------------------------- (Updated 2011-08-22 23:01:36.319406) Review request for Sqoop. Changes ------- I added lzop to the CodecMap and modified the tests to reference the codec with the short name. I added a blurb at the end of the Hive documentation describing the splitting you get with the lzop codec. I also fixed the checkstyle issues. Summary ------- I added a check when generating the create table string to see if the LzopCodec is in use. If it is, it outputs STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" at the end of the create table command, otherwise it outputs the standard STORED AS TEXTFILE I also added a call to the DistributedLzoIndexer before the data is imported into Hive. This addresses bug SQOOP-318 . https://issues.apache.org/jira/browse/SQOOP-318 Diffs (updated) src/docs/user/hive.txt 059d7cb src/java/com/cloudera/sqoop/hive/HiveImport.java 36c17ba src/java/com/cloudera/sqoop/hive/TableDefWriter.java 7dd9135 src/java/com/cloudera/sqoop/io/CodecMap.java 8564164 src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 43b755e Diff: https://reviews.apache.org/r/1597/diff Testing ------- It includes a test for the create table syntax. I manually tested calling the indexer. I'm not sure how to automate that without making LZO required to build. Thanks, Joey
        Joey Echeverria made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Joey Echeverria made changes -
        Attachment SQOOP-318-2.patch [ 12491286 ]
        Hide
        Joey Echeverria added a comment -

        Implemented recommendations made on review board.

        Show
        Joey Echeverria added a comment - Implemented recommendations made on review board.
        Joey Echeverria made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Joey Echeverria added a comment -

        Canceling first patch.

        Show
        Joey Echeverria added a comment - Canceling first patch.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1597/#review1563
        -----------------------------------------------------------

        Great patch Joey! I do have a high-level suggestion of adding a mapping to alias "lzop" to the codec "com.hadoop.compression.lzo.LzopCodec" in com.cloudera.sqoop.io.CodecMap implementation. If you do that, it is likely that the tests you have added in HiveImport and TableDefWriter will have to be modified in order to accommodate the use of the alias.

        Also, it would be great to have a blurb about this in the user guide under src/docs/user.

        Some minor checkstyle issues noted below.

        src/java/com/cloudera/sqoop/hive/HiveImport.java
        <https://reviews.apache.org/r/1597/#comment3536>

        Indent.

        src/java/com/cloudera/sqoop/hive/HiveImport.java
        <https://reviews.apache.org/r/1597/#comment3537>

        Line longer than 80.

        src/java/com/cloudera/sqoop/hive/HiveImport.java
        <https://reviews.apache.org/r/1597/#comment3538>

        Line longer than 80.

        src/java/com/cloudera/sqoop/hive/TableDefWriter.java
        <https://reviews.apache.org/r/1597/#comment3539>

        Lines longer than 80.

        • Arvind

        On 2011-08-19 18:49:06, Joey Echeverria wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1597/

        -----------------------------------------------------------

        (Updated 2011-08-19 18:49:06)

        Review request for Sqoop.

        Summary

        -------

        I added a check when generating the create table string to see if the LzopCodec is in use. If it is, it outputs

        STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"

        OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

        at the end of the create table command, otherwise it outputs the standard

        STORED AS TEXTFILE

        I also added a call to the DistributedLzoIndexer before the data is imported into Hive.

        This addresses bug SQOOP-318.

        https://issues.apache.org/jira/browse/SQOOP-318

        Diffs

        -----

        src/java/com/cloudera/sqoop/hive/HiveImport.java 36c17ba

        src/java/com/cloudera/sqoop/hive/TableDefWriter.java 7dd9135

        src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 43b755e

        Diff: https://reviews.apache.org/r/1597/diff

        Testing

        -------

        It includes a test for the create table syntax. I manually tested calling the indexer. I'm not sure how to automate that without making LZO required to build.

        Thanks,

        Joey

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1597/#review1563 ----------------------------------------------------------- Great patch Joey! I do have a high-level suggestion of adding a mapping to alias "lzop" to the codec "com.hadoop.compression.lzo.LzopCodec" in com.cloudera.sqoop.io.CodecMap implementation. If you do that, it is likely that the tests you have added in HiveImport and TableDefWriter will have to be modified in order to accommodate the use of the alias. Also, it would be great to have a blurb about this in the user guide under src/docs/user. Some minor checkstyle issues noted below. src/java/com/cloudera/sqoop/hive/HiveImport.java < https://reviews.apache.org/r/1597/#comment3536 > Indent. src/java/com/cloudera/sqoop/hive/HiveImport.java < https://reviews.apache.org/r/1597/#comment3537 > Line longer than 80. src/java/com/cloudera/sqoop/hive/HiveImport.java < https://reviews.apache.org/r/1597/#comment3538 > Line longer than 80. src/java/com/cloudera/sqoop/hive/TableDefWriter.java < https://reviews.apache.org/r/1597/#comment3539 > Lines longer than 80. Arvind On 2011-08-19 18:49:06, Joey Echeverria wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1597/ ----------------------------------------------------------- (Updated 2011-08-19 18:49:06) Review request for Sqoop. Summary ------- I added a check when generating the create table string to see if the LzopCodec is in use. If it is, it outputs STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" at the end of the create table command, otherwise it outputs the standard STORED AS TEXTFILE I also added a call to the DistributedLzoIndexer before the data is imported into Hive. This addresses bug SQOOP-318 . https://issues.apache.org/jira/browse/SQOOP-318 Diffs ----- src/java/com/cloudera/sqoop/hive/HiveImport.java 36c17ba src/java/com/cloudera/sqoop/hive/TableDefWriter.java 7dd9135 src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 43b755e Diff: https://reviews.apache.org/r/1597/diff Testing ------- It includes a test for the create table syntax. I manually tested calling the indexer. I'm not sure how to automate that without making LZO required to build. Thanks, Joey
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1597/
        -----------------------------------------------------------

        Review request for Sqoop.

        Summary
        -------

        I added a check when generating the create table string to see if the LzopCodec is in use. If it is, it outputs

        STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
        OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

        at the end of the create table command, otherwise it outputs the standard

        STORED AS TEXTFILE

        I also added a call to the DistributedLzoIndexer before the data is imported into Hive.

        This addresses bug SQOOP-318.
        https://issues.apache.org/jira/browse/SQOOP-318

        Diffs


        src/java/com/cloudera/sqoop/hive/HiveImport.java 36c17ba
        src/java/com/cloudera/sqoop/hive/TableDefWriter.java 7dd9135
        src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 43b755e

        Diff: https://reviews.apache.org/r/1597/diff

        Testing
        -------

        It includes a test for the create table syntax. I manually tested calling the indexer. I'm not sure how to automate that without making LZO required to build.

        Thanks,

        Joey

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1597/ ----------------------------------------------------------- Review request for Sqoop. Summary ------- I added a check when generating the create table string to see if the LzopCodec is in use. If it is, it outputs STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" at the end of the create table command, otherwise it outputs the standard STORED AS TEXTFILE I also added a call to the DistributedLzoIndexer before the data is imported into Hive. This addresses bug SQOOP-318 . https://issues.apache.org/jira/browse/SQOOP-318 Diffs src/java/com/cloudera/sqoop/hive/HiveImport.java 36c17ba src/java/com/cloudera/sqoop/hive/TableDefWriter.java 7dd9135 src/test/com/cloudera/sqoop/hive/TestTableDefWriter.java 43b755e Diff: https://reviews.apache.org/r/1597/diff Testing ------- It includes a test for the create table syntax. I manually tested calling the indexer. I'm not sure how to automate that without making LZO required to build. Thanks, Joey
        Arvind Prabhakar made changes -
        Assignee Joey Echeverria [ fwiffo ]
        Joey Echeverria made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 1.3.0 [ 12317344 ]
        Joey Echeverria made changes -
        Field Original Value New Value
        Attachment SQOOP-318-1.patch [ 12490908 ]
        Hide
        Joey Echeverria added a comment -

        Here's my first cut at a patch. It includes a test for the create table syntax. I manually tested calling the indexer. I'm not sure how to automate that without making LZO required to build.

        Show
        Joey Echeverria added a comment - Here's my first cut at a patch. It includes a test for the create table syntax. I manually tested calling the indexer. I'm not sure how to automate that without making LZO required to build.
        Joey Echeverria created issue -

          People

          • Assignee:
            Joey Echeverria
            Reporter:
            Joey Echeverria
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development