HBase
  1. HBase
  2. HBASE-5741

ImportTsv does not check for table existence

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.90.4
    • Fix Version/s: 0.94.1, 0.95.0
    • Component/s: mapreduce
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The usage statement for the "importtsv" command to hbase claims this:

      "Note: if you do not use this option, then the target table must already exist in HBase" (in reference to the "importtsv.bulk.output" command-line option)

      The truth is, the table must exist no matter what, importtsv cannot and will not create it for you.

      This is the case because the createSubmittableJob method of ImportTsv does not even attempt to check if the table exists already, much less create it:

      (From org.apache.hadoop.hbase.mapreduce.ImportTsv.java)

      305 HTable table = new HTable(conf, tableName);

      The HTable method signature in use there assumes the table exists and runs a meta scan on it:

      (From org.apache.hadoop.hbase.client.HTable.java)

      142 * Creates an object to access a HBase table.
      ...
      151 public HTable(Configuration conf, final String tableName)

      What we should do inside of createSubmittableJob is something similar to what the "completebulkloads" command would do:

      (Taken from org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.java)

      690 boolean tableExists = this.doesTableExist(tableName);
      691 if (!tableExists) this.createTable(tableName,dirPath);

      Currently the docs are misleading, the table in fact must exist prior to running importtsv. We should check if it exists rather than assume it's already there and throw the below exception:

      12/03/14 17:15:42 WARN client.HConnectionManager$HConnectionImplementation: Encountered problems when prefetch META table:
      org.apache.hadoop.hbase.TableNotFoundException: Cannot find row in .META. for table: myTable2, row=myTable2,,99999999999999
      at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150)
      ...

      1. 5741-94.txt
        5 kB
        Ted Yu
      2. 5741-v3.txt
        5 kB
        Ted Yu
      3. HBase-5741.patch
        5 kB
        Himanshu Vashishtha
      4. HBase-5741-v2.patch
        5 kB
        Himanshu Vashishtha

        Activity

        Hide
        Himanshu Vashishtha added a comment -

        Yes, the javadoc says it half correct when it says that "the table must exist in HBase if you are not using the option importtsv.bulk.output".
        The behavior, in absence of the above property is to do inserts into the provided table; so it throws an exception if the table doesn't exist.

        When we use this option, importtsv tries to configure job by making an attempt to read all the regions of this table and then use the start keys of these regions as partition-markers for the TotalOrderPartitioner class. This usage eliminates the first start-row (which is always a EMPTY_BYTE_ARRAY), so even if we create a new table, it will be useless as for using it for configuring a job.

        For the case with LoadIncrementalHFile, the destination of the directory containing the hFiles is assumed to be in a specific format:
        <tableName>/<cf1>/<listOfHFiles>
        /<cf2>/<listOfHFiles>

        --where we are storing the hfiles for a specific column family in a separate sub-directory.

        If we do create a table based on the parameters given to the importtsv command, it will not be useful for the case of bulkload usecase as in the importtsv job, we dump hfiles based on rows; so all coulmn famlies for a specific row lands in one hfile.

        It would be great to know if we actually use this workflow: create HFiles from importtsv job, and then use bulkload to insert those HFiles in HBase table.

        I think we should change the javadoc.

        Please let me know if you have any questions; hbase-mapreduce use cases are exciting.

        Show
        Himanshu Vashishtha added a comment - Yes, the javadoc says it half correct when it says that "the table must exist in HBase if you are not using the option importtsv.bulk.output". The behavior, in absence of the above property is to do inserts into the provided table; so it throws an exception if the table doesn't exist. When we use this option, importtsv tries to configure job by making an attempt to read all the regions of this table and then use the start keys of these regions as partition-markers for the TotalOrderPartitioner class. This usage eliminates the first start-row (which is always a EMPTY_BYTE_ARRAY), so even if we create a new table, it will be useless as for using it for configuring a job. For the case with LoadIncrementalHFile, the destination of the directory containing the hFiles is assumed to be in a specific format: <tableName>/<cf1>/<listOfHFiles> /<cf2>/<listOfHFiles> --where we are storing the hfiles for a specific column family in a separate sub-directory. If we do create a table based on the parameters given to the importtsv command, it will not be useful for the case of bulkload usecase as in the importtsv job, we dump hfiles based on rows; so all coulmn famlies for a specific row lands in one hfile. It would be great to know if we actually use this workflow: create HFiles from importtsv job, and then use bulkload to insert those HFiles in HBase table. I think we should change the javadoc. Please let me know if you have any questions; hbase-mapreduce use cases are exciting.
        Hide
        Himanshu Vashishtha added a comment -

        btw, by "making an attempt to read all the regions of this table and then use the start keys of these regions", I mean get the start rows for all the existing regions of the table and then use them for job configuration.

        Show
        Himanshu Vashishtha added a comment - btw, by "making an attempt to read all the regions of this table and then use the start keys of these regions", I mean get the start rows for all the existing regions of the table and then use them for job configuration.
        Hide
        Clint Heath added a comment -

        Thanks for the feedback. We I have customers who are using the workflow you mentioned. This is why I discovered the confusing javadoc reference.

        Show
        Clint Heath added a comment - Thanks for the feedback. We I have customers who are using the workflow you mentioned. This is why I discovered the confusing javadoc reference.
        Hide
        Himanshu Vashishtha added a comment -

        You are right. I am on it now. Thanks.

        Show
        Himanshu Vashishtha added a comment - You are right. I am on it now. Thanks.
        Hide
        Himanshu Vashishtha added a comment -

        A patch for checking/adding non-existant table in hbase in case we are using bulkoutput option. Added a test in the TestImportTsv. Please comment.

        Show
        Himanshu Vashishtha added a comment - A patch for checking/adding non-existant table in hbase in case we are using bulkoutput option. Added a test in the TestImportTsv. Please comment.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4700/
        -----------------------------------------------------------

        Review request for hbase.

        Summary
        -------

        There is a bulk output option in the importtsv workload. It outputs the HFiles in a user defined directory. The current code assumes that a table with its name equal to the given output directory exists, and throws an exception otherwise. Here is a patch for creating a table in case it doesn't exist.

        This addresses bug HBase-5741.
        https://issues.apache.org/jira/browse/HBase-5741

        Diffs


        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java ab22fc4
        src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java ac30a62

        Diff: https://reviews.apache.org/r/4700/diff

        Testing
        -------

        Added a new test for bulkoutput; All importtsv tests pass.

        Thanks,

        Himanshu

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4700/ ----------------------------------------------------------- Review request for hbase. Summary ------- There is a bulk output option in the importtsv workload. It outputs the HFiles in a user defined directory. The current code assumes that a table with its name equal to the given output directory exists, and throws an exception otherwise. Here is a patch for creating a table in case it doesn't exist. This addresses bug HBase-5741. https://issues.apache.org/jira/browse/HBase-5741 Diffs src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java ab22fc4 src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java ac30a62 Diff: https://reviews.apache.org/r/4700/diff Testing ------- Added a new test for bulkoutput; All importtsv tests pass. Thanks, Himanshu
        Hide
        Himanshu Vashishtha added a comment -

        Uploaded the patch on rb;
        https://reviews.apache.org/r/4700/

        Show
        Himanshu Vashishtha added a comment - Uploaded the patch on rb; https://reviews.apache.org/r/4700/
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4700/#review6857
        -----------------------------------------------------------

        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        <https://reviews.apache.org/r/4700/#comment15254>

        I think the new method should be called doesTableExist() or tableExists()

        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        <https://reviews.apache.org/r/4700/#comment15256>

        Insert space between for and (.

        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        <https://reviews.apache.org/r/4700/#comment15257>

        Insert space between if and (.

        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        <https://reviews.apache.org/r/4700/#comment15255>

        Insert one leading space before cfSet

        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        <https://reviews.apache.org/r/4700/#comment15258>

        Can this method be package private ?
        Since HBaseAdmin is created inside, a better name would be createHBaseAdmin().

        src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
        <https://reviews.apache.org/r/4700/#comment15259>

        Insert space between if and (.

        src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
        <https://reviews.apache.org/r/4700/#comment15260>

        Insert space between else and {.

        • Ted

        On 2012-04-11 17:16:59, Himanshu Vashishtha wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4700/

        -----------------------------------------------------------

        (Updated 2012-04-11 17:16:59)

        Review request for hbase.

        Summary

        -------

        There is a bulk output option in the importtsv workload. It outputs the HFiles in a user defined directory. The current code assumes that a table with its name equal to the given output directory exists, and throws an exception otherwise. Here is a patch for creating a table in case it doesn't exist.

        This addresses bug HBase-5741.

        https://issues.apache.org/jira/browse/HBase-5741

        Diffs

        -----

        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java ab22fc4

        src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java ac30a62

        Diff: https://reviews.apache.org/r/4700/diff

        Testing

        -------

        Added a new test for bulkoutput; All importtsv tests pass.

        Thanks,

        Himanshu

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4700/#review6857 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < https://reviews.apache.org/r/4700/#comment15254 > I think the new method should be called doesTableExist() or tableExists() src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < https://reviews.apache.org/r/4700/#comment15256 > Insert space between for and (. src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < https://reviews.apache.org/r/4700/#comment15257 > Insert space between if and (. src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < https://reviews.apache.org/r/4700/#comment15255 > Insert one leading space before cfSet src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < https://reviews.apache.org/r/4700/#comment15258 > Can this method be package private ? Since HBaseAdmin is created inside, a better name would be createHBaseAdmin(). src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java < https://reviews.apache.org/r/4700/#comment15259 > Insert space between if and (. src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java < https://reviews.apache.org/r/4700/#comment15260 > Insert space between else and {. Ted On 2012-04-11 17:16:59, Himanshu Vashishtha wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4700/ ----------------------------------------------------------- (Updated 2012-04-11 17:16:59) Review request for hbase. Summary ------- There is a bulk output option in the importtsv workload. It outputs the HFiles in a user defined directory. The current code assumes that a table with its name equal to the given output directory exists, and throws an exception otherwise. Here is a patch for creating a table in case it doesn't exist. This addresses bug HBase-5741. https://issues.apache.org/jira/browse/HBase-5741 Diffs ----- src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java ab22fc4 src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java ac30a62 Diff: https://reviews.apache.org/r/4700/diff Testing ------- Added a new test for bulkoutput; All importtsv tests pass. Thanks, Himanshu
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4700/
        -----------------------------------------------------------

        (Updated 2012-04-11 19:01:39.660787)

        Review request for hbase.

        Changes
        -------

        Changes done as per Ted.

        Summary
        -------

        There is a bulk output option in the importtsv workload. It outputs the HFiles in a user defined directory. The current code assumes that a table with its name equal to the given output directory exists, and throws an exception otherwise. Here is a patch for creating a table in case it doesn't exist.

        This addresses bug HBase-5741.
        https://issues.apache.org/jira/browse/HBase-5741

        Diffs (updated)


        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java ab22fc4
        src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java ac30a62

        Diff: https://reviews.apache.org/r/4700/diff

        Testing
        -------

        Added a new test for bulkoutput; All importtsv tests pass.

        Thanks,

        Himanshu

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4700/ ----------------------------------------------------------- (Updated 2012-04-11 19:01:39.660787) Review request for hbase. Changes ------- Changes done as per Ted. Summary ------- There is a bulk output option in the importtsv workload. It outputs the HFiles in a user defined directory. The current code assumes that a table with its name equal to the given output directory exists, and throws an exception otherwise. Here is a patch for creating a table in case it doesn't exist. This addresses bug HBase-5741. https://issues.apache.org/jira/browse/HBase-5741 Diffs (updated) src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java ab22fc4 src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java ac30a62 Diff: https://reviews.apache.org/r/4700/diff Testing ------- Added a new test for bulkoutput; All importtsv tests pass. Thanks, Himanshu
        Hide
        Himanshu Vashishtha added a comment -

        Patch version 2 after rb.

        Show
        Himanshu Vashishtha added a comment - Patch version 2 after rb.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12522299/HBase-5741-v2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1483//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1483//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1483//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522299/HBase-5741-v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1483//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1483//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1483//console This message is automatically generated.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4700/#review6860
        -----------------------------------------------------------

        Ship it!

        Please attach new patch to JIRA after addressing minor comments.

        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        <https://reviews.apache.org/r/4700/#comment15276>

        The parentheses around hbaseAdmin.tableExists() are not needed.

        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        <https://reviews.apache.org/r/4700/#comment15277>

        Move this to line 263.

        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        <https://reviews.apache.org/r/4700/#comment15278>

        'bothered about' -> 'concerned with'

        • Ted

        On 2012-04-11 19:01:39, Himanshu Vashishtha wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4700/

        -----------------------------------------------------------

        (Updated 2012-04-11 19:01:39)

        Review request for hbase.

        Summary

        -------

        There is a bulk output option in the importtsv workload. It outputs the HFiles in a user defined directory. The current code assumes that a table with its name equal to the given output directory exists, and throws an exception otherwise. Here is a patch for creating a table in case it doesn't exist.

        This addresses bug HBase-5741.

        https://issues.apache.org/jira/browse/HBase-5741

        Diffs

        -----

        src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java ab22fc4

        src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java ac30a62

        Diff: https://reviews.apache.org/r/4700/diff

        Testing

        -------

        Added a new test for bulkoutput; All importtsv tests pass.

        Thanks,

        Himanshu

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4700/#review6860 ----------------------------------------------------------- Ship it! Please attach new patch to JIRA after addressing minor comments. src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < https://reviews.apache.org/r/4700/#comment15276 > The parentheses around hbaseAdmin.tableExists() are not needed. src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < https://reviews.apache.org/r/4700/#comment15277 > Move this to line 263. src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java < https://reviews.apache.org/r/4700/#comment15278 > 'bothered about' -> 'concerned with' Ted On 2012-04-11 19:01:39, Himanshu Vashishtha wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4700/ ----------------------------------------------------------- (Updated 2012-04-11 19:01:39) Review request for hbase. Summary ------- There is a bulk output option in the importtsv workload. It outputs the HFiles in a user defined directory. The current code assumes that a table with its name equal to the given output directory exists, and throws an exception otherwise. Here is a patch for creating a table in case it doesn't exist. This addresses bug HBase-5741. https://issues.apache.org/jira/browse/HBase-5741 Diffs ----- src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java ab22fc4 src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java ac30a62 Diff: https://reviews.apache.org/r/4700/diff Testing ------- Added a new test for bulkoutput; All importtsv tests pass. Thanks, Himanshu
        Hide
        Ted Yu added a comment -

        Patch v3 addresses latest review comments.

        Show
        Ted Yu added a comment - Patch v3 addresses latest review comments.
        Hide
        Ted Yu added a comment -

        Patch for 0.94 where TestImportTsv passed.

        Show
        Ted Yu added a comment - Patch for 0.94 where TestImportTsv passed.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12522330/5741-v3.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1485//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1485//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1485//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522330/5741-v3.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1485//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1485//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1485//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        I would integrate patches to trunk and 0.94 tomorrow morning if there is no objection.

        Show
        Ted Yu added a comment - I would integrate patches to trunk and 0.94 tomorrow morning if there is no objection.
        Hide
        Lars Hofhansl added a comment -

        Sorry for chiming in late here, but do we actually want ImportTsv to create the table for us?
        Personally I'd rather have it fail, which forces me to think about how I want create the table (pre split, compression, etc).

        Show
        Lars Hofhansl added a comment - Sorry for chiming in late here, but do we actually want ImportTsv to create the table for us? Personally I'd rather have it fail, which forces me to think about how I want create the table (pre split, compression, etc).
        Hide
        Ted Yu added a comment -

        How about adding an option to ImportTsv, default to false, allowing user to auto-create the table ?

        Show
        Ted Yu added a comment - How about adding an option to ImportTsv, default to false, allowing user to auto-create the table ?
        Hide
        Lars Hofhansl added a comment -

        Import does not do that either. At least we should be consistent between the various importing tools.
        It seems overkill to burden these tools with that.

        That said, I am not opposed, just questioning the usefulness.

        Show
        Lars Hofhansl added a comment - Import does not do that either. At least we should be consistent between the various importing tools. It seems overkill to burden these tools with that. That said, I am not opposed, just questioning the usefulness.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12522343/5741-v3.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1487//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1487//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1487//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522343/5741-v3.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1487//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1487//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1487//console This message is automatically generated.
        Hide
        Himanshu Vashishtha added a comment -

        As Clint mentioned, there are some workloads that use ImportTsv to create HFiles, which later on are used in the bulk import method : my motivation for the patch.

        What you suggest Lars, failing the importtsv, or adding a similar thing in the Import tool too? I think later makes it consistent.

        Show
        Himanshu Vashishtha added a comment - As Clint mentioned, there are some workloads that use ImportTsv to create HFiles, which later on are used in the bulk import method : my motivation for the patch. What you suggest Lars, failing the importtsv, or adding a similar thing in the Import tool too? I think later makes it consistent.
        Hide
        Lars Hofhansl added a comment - - edited

        Even HFileOutputFormat uses the table in HBase to find out about splits. We will rarely bulk import into an empty table that is not pre-split...? (I'm assuming here, you folks at Cloudera see many more use cases that I do)

        If you believe strongly that we should add this, let's do it
        We can do Import in a separate patch, or not do that as we see fit.

        Show
        Lars Hofhansl added a comment - - edited Even HFileOutputFormat uses the table in HBase to find out about splits. We will rarely bulk import into an empty table that is not pre-split...? (I'm assuming here, you folks at Cloudera see many more use cases that I do) If you believe strongly that we should add this, let's do it We can do Import in a separate patch, or not do that as we see fit.
        Hide
        Lars Hofhansl added a comment -

        Moving to 0.94.1, since we're still discussing (again, sorry about the late chime in)

        Show
        Lars Hofhansl added a comment - Moving to 0.94.1, since we're still discussing (again, sorry about the late chime in)
        Hide
        Himanshu Vashishtha added a comment -

        I am waiting for Clint to give me some numbers of such use cases to make the cut; (Apparently, he is on vacation these days and will be back this Monday)
        Thanks.

        Show
        Himanshu Vashishtha added a comment - I am waiting for Clint to give me some numbers of such use cases to make the cut; (Apparently, he is on vacation these days and will be back this Monday) Thanks.
        Hide
        Clint Heath added a comment -

        Lars, et al, thank you for your questions and comments. I may be missing something, so please correct me if I'm wrong, but here's how I see the situation:

        1) importtsv will auto-split the input data in region-server-sized Hfiles, based on hbase.hregion.max.filesize (because we use the HFileOutputFormat and it checks this parameter from the client configs).

        2) completebulkload then distributes the resulting HFiles to the region servers, thereby populating your cluster with all the regions needed to host your data. Alleviating all the memstore flushes, compactions, splitting, and other overhead associated with running billions of individual put()s to incrementally load your data. This is why we recommend the bulk load process to initially populate HBase tables.

        3) pre-splitting your table (based on rowkeys, as is the only mechanism) is useless in this scenario, because it will not take into account the size of data associated with each key. The customer would literally have to walk through their data and determine how many Gigs of data are associated with each rowkey in order to preemptively split their tables in a bulk load scenario...and even then, HBase only takes those pre-splits as "suggestions" and if the data ends up needing to be split differently, that will happen automatically. So what purpose does pre-splitting serve in a bulk load? Think "initial load" of massive amounts of data.

        4) Lars had a good point about compression, yes, typically a customer should set up their table with compression first, but they typically don't know this and they can do it after the fact without consequence.

        In conclusion, we provide this handy command-line tool to help our HBase customers (most of whom are complete newbies and don't know what they're doing) get their HBase tables up and running, yet we give them conflicting information about how to use the tool. If we need to tell them to pre-create the table, then we should tell them so...clearly. However, I think it is just simpler and more intuitive if the tool did that automatically, as the javadocs indicate it will. Since region splits are determined internally by the HFileOutputFormat anyway (if they use the bulkload option), what's the harm? This enables people to get their data loaded into HBase with the least amount of education, ramp-up, java development, data mining, etc., etc.

        Show
        Clint Heath added a comment - Lars, et al, thank you for your questions and comments. I may be missing something, so please correct me if I'm wrong, but here's how I see the situation: 1) importtsv will auto-split the input data in region-server-sized Hfiles, based on hbase.hregion.max.filesize (because we use the HFileOutputFormat and it checks this parameter from the client configs). 2) completebulkload then distributes the resulting HFiles to the region servers, thereby populating your cluster with all the regions needed to host your data. Alleviating all the memstore flushes, compactions, splitting, and other overhead associated with running billions of individual put()s to incrementally load your data. This is why we recommend the bulk load process to initially populate HBase tables. 3) pre-splitting your table (based on rowkeys, as is the only mechanism) is useless in this scenario, because it will not take into account the size of data associated with each key. The customer would literally have to walk through their data and determine how many Gigs of data are associated with each rowkey in order to preemptively split their tables in a bulk load scenario...and even then, HBase only takes those pre-splits as "suggestions" and if the data ends up needing to be split differently, that will happen automatically. So what purpose does pre-splitting serve in a bulk load? Think "initial load" of massive amounts of data. 4) Lars had a good point about compression, yes, typically a customer should set up their table with compression first, but they typically don't know this and they can do it after the fact without consequence. In conclusion, we provide this handy command-line tool to help our HBase customers (most of whom are complete newbies and don't know what they're doing) get their HBase tables up and running, yet we give them conflicting information about how to use the tool. If we need to tell them to pre-create the table, then we should tell them so...clearly. However, I think it is just simpler and more intuitive if the tool did that automatically, as the javadocs indicate it will. Since region splits are determined internally by the HFileOutputFormat anyway (if they use the bulkload option), what's the harm? This enables people to get their data loaded into HBase with the least amount of education, ramp-up, java development, data mining, etc., etc.
        Hide
        Lars Hofhansl added a comment -

        Thanks Clint. No harm, I just think it will hide complexity that should not be hidden. I also don't think it's too much to ask a user to create the table first (but we should definitely fix the Javadoc in that case).

        That all said, I am +-0 on this. It's a simple change, and v3 looks good.

        Show
        Lars Hofhansl added a comment - Thanks Clint. No harm, I just think it will hide complexity that should not be hidden. I also don't think it's too much to ask a user to create the table first (but we should definitely fix the Javadoc in that case). That all said, I am +-0 on this. It's a simple change, and v3 looks good.
        Hide
        Ted Yu added a comment -

        Integrated to trunk.

        Thanks for the patch Himanshu.

        Will wait for 0.94.0 to come out before integrating to 0.94

        Show
        Ted Yu added a comment - Integrated to trunk. Thanks for the patch Himanshu. Will wait for 0.94.0 to come out before integrating to 0.94
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-security #174 (See https://builds.apache.org/job/HBase-TRUNK-security/174/)
        HBASE-5741 ImportTsv does not check for table existence (Himanshu) (Revision 1327338)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-security #174 (See https://builds.apache.org/job/HBase-TRUNK-security/174/ ) HBASE-5741 ImportTsv does not check for table existence (Himanshu) (Revision 1327338) Result = FAILURE tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
        Hide
        Ted Yu added a comment -

        Integrated to 0.94 as well.

        Show
        Ted Yu added a comment - Integrated to 0.94 as well.
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #133 (See https://builds.apache.org/job/HBase-0.94/133/)
        HBASE-5741 ImportTsv does not check for table existence (Himanshu) (Revision 1328032)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #133 (See https://builds.apache.org/job/HBase-0.94/133/ ) HBASE-5741 ImportTsv does not check for table existence (Himanshu) (Revision 1328032) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #17 (See https://builds.apache.org/job/HBase-0.94-security/17/)
        HBASE-5741 ImportTsv does not check for table existence (Himanshu) (Revision 1328032)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #17 (See https://builds.apache.org/job/HBase-0.94-security/17/ ) HBASE-5741 ImportTsv does not check for table existence (Himanshu) (Revision 1328032) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
        Hide
        Lars Hofhansl added a comment -

        This was committed a while back... Marking it accordingly.

        Show
        Lars Hofhansl added a comment - This was committed a while back... Marking it accordingly.

          People

          • Assignee:
            Himanshu Vashishtha
            Reporter:
            Clint Heath
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development