Hive
  1. Hive
  2. HIVE-2365

SQL support for bulk load into HBase

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: HBase Handler
    • Labels:
      None

      Description

      Support the "as simple as this" SQL for bulk load from Hive into HBase.

      1. HIVE-2365.2.patch.txt
        47 kB
        Nick Dimiduk
      2. HIVE-2365.WIP.01.patch
        42 kB
        Nick Dimiduk
      3. HIVE-2365.WIP.01.patch
        435 kB
        Nick Dimiduk
      4. HIVE-2365.WIP.00.patch
        38 kB
        Nick Dimiduk

        Issue Links

          Activity

          Hide
          Hive QA added a comment -

          Overall: -1 no tests executed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12629897/HIVE-2365.2.patch.txt

          Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1429/testReport
          Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1429/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Tests exited with: NonZeroExitCodeException
          Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n '' ]]
          + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
          + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
          + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
          + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128'
          + cd /data/hive-ptest/working/
          + tee /data/hive-ptest/logs/PreCommit-HIVE-Build-1429/source-prep.txt
          + [[ false == \t\r\u\e ]]
          + mkdir -p maven ivy
          + [[ svn = \s\v\n ]]
          + [[ -n '' ]]
          + [[ -d apache-svn-trunk-source ]]
          + [[ ! -d apache-svn-trunk-source/.svn ]]
          + [[ ! -d apache-svn-trunk-source ]]
          + cd apache-svn-trunk-source
          + svn revert -R .
          Reverted 'ql/src/test/results/clientnegative/authorization_set_role_neg2.q.out'
          Reverted 'ql/src/java/org/apache/hadoop/hive/ql/security/authorization/plugin/sqlstd/SQLStdHiveAccessController.java'
          ++ egrep -v '^X|^Performing status on external'
          ++ awk '{print $2}'
          ++ svn status --no-ignore
          + rm -rf target datanucleus.log ant/target shims/target shims/0.20/target shims/0.20S/target shims/0.23/target shims/aggregator/target shims/common/target shims/common-secure/target packaging/target hbase-handler/target testutils/target jdbc/target metastore/target itests/target itests/hcatalog-unit/target itests/test-serde/target itests/qtest/target itests/hive-unit/target itests/custom-serde/target itests/util/target hcatalog/target hcatalog/storage-handlers/hbase/target hcatalog/server-extensions/target hcatalog/core/target hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target hcatalog/hcatalog-pig-adapter/target hwi/target common/target common/src/gen contrib/target service/target serde/target beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target ql/src/test/results/clientnegative/authorization_role_grant.q.out ql/src/test/results/clientpositive/authorization_role_grant2.q.out ql/src/test/queries/clientnegative/authorization_role_grant.q ql/src/test/queries/clientpositive/authorization_role_grant2.q
          + svn update
          D    ql/src/test/queries/clientpositive/authorization_view.q
          U    ql/src/test/queries/clientpositive/authorization_view_sqlstd.q
          D    ql/src/test/results/clientpositive/authorization_view.q.out
          U    ql/src/test/results/clientpositive/authorization_view_sqlstd.q.out
          U    ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g
          
          Fetching external item into 'hcatalog/src/test/e2e/harness'
          Updated external to revision 1570354.
          
          Updated to revision 1570354.
          + patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh
          + patchFilePath=/data/hive-ptest/working/scratch/build.patch
          + [[ -f /data/hive-ptest/working/scratch/build.patch ]]
          + chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh
          + /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch
          The patch does not appear to apply with p0, p1, or p2
          + exit 1
          '
          

          This message is automatically generated.

          ATTACHMENT ID: 12629897

          Show
          Hive QA added a comment - Overall : -1 no tests executed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12629897/HIVE-2365.2.patch.txt Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1429/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1429/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n '' ]] + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hive-ptest/working/ + tee /data/hive-ptest/logs/PreCommit-HIVE-Build-1429/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ svn = \s\v\n ]] + [[ -n '' ]] + [[ -d apache-svn-trunk-source ]] + [[ ! -d apache-svn-trunk-source/.svn ]] + [[ ! -d apache-svn-trunk-source ]] + cd apache-svn-trunk-source + svn revert -R . Reverted 'ql/src/test/results/clientnegative/authorization_set_role_neg2.q.out' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/security/authorization/plugin/sqlstd/SQLStdHiveAccessController.java' ++ egrep -v '^X|^Performing status on external' ++ awk '{print $2}' ++ svn status --no-ignore + rm -rf target datanucleus.log ant/target shims/target shims/0.20/target shims/0.20S/target shims/0.23/target shims/aggregator/target shims/common/target shims/common-secure/target packaging/target hbase-handler/target testutils/target jdbc/target metastore/target itests/target itests/hcatalog-unit/target itests/test-serde/target itests/qtest/target itests/hive-unit/target itests/custom-serde/target itests/util/target hcatalog/target hcatalog/storage-handlers/hbase/target hcatalog/server-extensions/target hcatalog/core/target hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target hcatalog/hcatalog-pig-adapter/target hwi/target common/target common/src/gen contrib/target service/target serde/target beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target ql/src/test/results/clientnegative/authorization_role_grant.q.out ql/src/test/results/clientpositive/authorization_role_grant2.q.out ql/src/test/queries/clientnegative/authorization_role_grant.q ql/src/test/queries/clientpositive/authorization_role_grant2.q + svn update D ql/src/test/queries/clientpositive/authorization_view.q U ql/src/test/queries/clientpositive/authorization_view_sqlstd.q D ql/src/test/results/clientpositive/authorization_view.q.out U ql/src/test/results/clientpositive/authorization_view_sqlstd.q.out U ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g Fetching external item into 'hcatalog/src/test/e2e/harness' Updated external to revision 1570354. Updated to revision 1570354. + patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hive-ptest/working/scratch/build.patch + [[ -f /data/hive-ptest/working/scratch/build.patch ]] + chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh + /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch The patch does not appear to apply with p0, p1, or p2 + exit 1 ' This message is automatically generated. ATTACHMENT ID: 12629897
          Hide
          Nick Dimiduk added a comment -

          I've separated the HBaseStorageHandler changes into their own issue, HIVE-6473. When that ticket gets some attention, I'll rebase this patch onto that one. I hope that will make the needed SA hooks more apparent.

          Show
          Nick Dimiduk added a comment - I've separated the HBaseStorageHandler changes into their own issue, HIVE-6473 . When that ticket gets some attention, I'll rebase this patch onto that one. I hope that will make the needed SA hooks more apparent.
          Hide
          Nick Dimiduk added a comment -

          This patch separates HFile generation from the new completebulkload task. Also adds a couple more test cases.

          Show
          Nick Dimiduk added a comment - This patch separates HFile generation from the new completebulkload task. Also adds a couple more test cases.
          Hide
          Nick Dimiduk added a comment -

          regenerate correct the patch.

          Show
          Nick Dimiduk added a comment - regenerate correct the patch.
          Hide
          Nick Dimiduk added a comment -

          Rebased onto trunk and includes a basic positive test with explain plan.

          Show
          Nick Dimiduk added a comment - Rebased onto trunk and includes a basic positive test with explain plan.
          Hide
          Nick Dimiduk added a comment -

          Here's a proof of concept patch. It allows tables defined using the HBaseStorageHandler to switch into "bulk mode" when hive.hbase.bulkload=true. It adds functionality to MoveTask, allowing it to complete the bulkload as a post-job task. This task is always appended when "bulk mode" is detected.

          I've tried to keep the modifications in hive-exec to a minimum. However, I was forced to introduce HBase as a dependency of that module. Please advise how I might add necessary hooks to move these code paths back into the hbase-handler module.

          Show
          Nick Dimiduk added a comment - Here's a proof of concept patch. It allows tables defined using the HBaseStorageHandler to switch into "bulk mode" when hive.hbase.bulkload=true. It adds functionality to MoveTask, allowing it to complete the bulkload as a post-job task. This task is always appended when "bulk mode" is detected. I've tried to keep the modifications in hive-exec to a minimum. However, I was forced to introduce HBase as a dependency of that module. Please advise how I might add necessary hooks to move these code paths back into the hbase-handler module.
          Hide
          Nick Dimiduk added a comment -

          After much fighting with input data and ordering, I have my first little improvement. I've started a WIP branch over on Github. I will regularly rewrite it's history, but if you'd like to follow along, I'll take comments as they come. Once things take shape, I'll squash into a patch and attach here.

          The patch posted supports generating HFiles from a table defined using the HBaseStorageHandler. The next improvement here is to actually rewrite the plan to introduce a step that invokes LoadIncrementalHFiles. After that, we can get rid of the need for specifying hfile.family.path, just detect it from the column family from the mapping attribute and write the HFiles to a temporary location before loading.

          Show
          Nick Dimiduk added a comment - After much fighting with input data and ordering, I have my first little improvement. I've started a WIP branch over on Github. I will regularly rewrite it's history, but if you'd like to follow along, I'll take comments as they come. Once things take shape, I'll squash into a patch and attach here. The patch posted supports generating HFiles from a table defined using the HBaseStorageHandler. The next improvement here is to actually rewrite the plan to introduce a step that invokes LoadIncrementalHFiles. After that, we can get rid of the need for specifying hfile.family.path, just detect it from the column family from the mapping attribute and write the HFiles to a temporary location before loading.
          Hide
          Navis added a comment -

          Nick Dimiduk I think so. But I've forgot whole context of HIVE-4765 (which is 6 months old). So I cannot sure of that.

          Show
          Navis added a comment - Nick Dimiduk I think so. But I've forgot whole context of HIVE-4765 (which is 6 months old). So I cannot sure of that.
          Hide
          Nick Dimiduk added a comment -

          Ah, interesting, thanks Navis. I hadn't gotten to the sampling problem yet. Does HIVE-4765 resolve the problems I saw in HIVE-4627?

          Show
          Nick Dimiduk added a comment - Ah, interesting, thanks Navis . I hadn't gotten to the sampling problem yet. Does HIVE-4765 resolve the problems I saw in HIVE-4627 ?
          Hide
          Navis added a comment -

          Nick Dimiduk You might be interested in HIVE-4765, also.

          Show
          Navis added a comment - Nick Dimiduk You might be interested in HIVE-4765 , also.
          Hide
          Nick Dimiduk added a comment -

          So I've been talking to Sushanth Sowmyan and Vikram Dixit K about this. My thinking is to start with a prototype that extends the planner to detect the HBase table and a "use bulkload" flag. It'll then make the appropriate modifications by way of a new HBaseSinkOperation or similar. That'll highlight any other underlying issues (I'm suspect of HIVE-4627). That'll give me a change to learn the Hive code a bit and let you guys work out how exactly the hbase handler should hook into the planer.

          CC Brock Noland, Viraj Bhat as you folk may also be interested.

          Show
          Nick Dimiduk added a comment - So I've been talking to Sushanth Sowmyan and Vikram Dixit K about this. My thinking is to start with a prototype that extends the planner to detect the HBase table and a "use bulkload" flag. It'll then make the appropriate modifications by way of a new HBaseSinkOperation or similar. That'll highlight any other underlying issues (I'm suspect of HIVE-4627 ). That'll give me a change to learn the Hive code a bit and let you guys work out how exactly the hbase handler should hook into the planer. CC Brock Noland , Viraj Bhat as you folk may also be interested.
          Hide
          Nick Dimiduk added a comment -

          Let me take a crack at this one.

          Show
          Nick Dimiduk added a comment - Let me take a crack at this one.
          Hide
          John Sichi added a comment -

          (Just realized I forgot to link the original doc where "as simple as this" is mentioned.)

          https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad

          This issue pertains to INSERT of large amounts of data into HBase from Hive (not CREATE; I'll follow up separately in HIVE-2373).

          The major challenges here are:

          • automating the sampling needed for coming up with the range partitioning for the global sort
          • extending Hive's INSERT to express the whole thing
          • chaining together the sampling job with the actual load job and tying together the relevant bits such as temporary file locations (we've had success doing something similar via reentrant SQL for index load/query statements)
          • making the load use the HBase bulk load API which was added subsequent to the original Hive work
          Show
          John Sichi added a comment - (Just realized I forgot to link the original doc where "as simple as this" is mentioned.) https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad This issue pertains to INSERT of large amounts of data into HBase from Hive (not CREATE; I'll follow up separately in HIVE-2373 ). The major challenges here are: automating the sampling needed for coming up with the range partitioning for the global sort extending Hive's INSERT to express the whole thing chaining together the sampling job with the actual load job and tying together the relevant bits such as temporary file locations (we've had success doing something similar via reentrant SQL for index load/query statements) making the load use the HBase bulk load API which was added subsequent to the original Hive work
          Hide
          Alex Newman added a comment -

          So I would love to take this up but, I have a couple of questions.

          Is the issue that the create statement is to complicated? You do have to specify a lot of things.
          Do you want all of the steps in one SQL command?
          What's the right balance of configurability vs automagically?

          Show
          Alex Newman added a comment - So I would love to take this up but, I have a couple of questions. Is the issue that the create statement is to complicated? You do have to specify a lot of things. Do you want all of the steps in one SQL command? What's the right balance of configurability vs automagically?

            People

            • Assignee:
              Nick Dimiduk
              Reporter:
              John Sichi
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development