Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: JDBC
    • Labels:
      None

      Description

      With the Cassandra and HBase Storage Handlers I thought it would make sense to include a generic JDBC RDBMS Storage Handler so that you could import a standard DB table into Hive. Many people must want to perform HiveQL joins, etc against tables in other systems etc.

        Issue Links

          Activity

          Hide
          Teddy Choi added a comment -

          I have some troubles to process it further. I'll leave it unassigned. I'm sorry for changing my decision.

          Show
          Teddy Choi added a comment - I have some troubles to process it further. I'll leave it unassigned. I'm sorry for changing my decision.
          Hide
          Teddy Choi added a comment -

          Okay. I'll take it. I'll follow the existing design as close as possible.

          I think MySQL is not a good point to start test. It runs with naive executables and has driver license issues. I will use a pure Java database for test first. Then I will widen its coverage.

          Show
          Teddy Choi added a comment - Okay. I'll take it. I'll follow the existing design as close as possible. I think MySQL is not a good point to start test. It runs with naive executables and has driver license issues. I will use a pure Java database for test first. Then I will widen its coverage.
          Hide
          Ashutosh Chauhan added a comment -

          Teddy Choi I think you can take it up. I am not seeing any activity from previous contributors of this jira.

          Show
          Ashutosh Chauhan added a comment - Teddy Choi I think you can take it up. I am not seeing any activity from previous contributors of this jira.
          Hide
          Teddy Choi added a comment -

          Ashutosh Chauhan, I want to take it. But I'm not sure whether I can do it. I'll take a look.

          Show
          Teddy Choi added a comment - Ashutosh Chauhan , I want to take it. But I'm not sure whether I can do it. I'll take a look.
          Hide
          Ashutosh Chauhan added a comment -

          Looking at comments and watchers list, looks like there is a lot of interest in this. But, I don't see any patch yet. Some one wants to take this up?

          Show
          Ashutosh Chauhan added a comment - Looking at comments and watchers list, looks like there is a lot of interest in this. But, I don't see any patch yet. Some one wants to take this up?
          Hide
          Jakub Holy added a comment -

          Can I help in any way to get this into trunk?

          Show
          Jakub Holy added a comment - Can I help in any way to get this into trunk?
          Hide
          Luc Pezet added a comment -

          Any updates on this?

          Show
          Luc Pezet added a comment - Any updates on this?
          Hide
          Kasun Gunathilake added a comment -

          Hi Andrew,

          Is this finished? If you can please share your patch it will be very useful.

          ~Kasun

          Show
          Kasun Gunathilake added a comment - Hi Andrew, Is this finished? If you can please share your patch it will be very useful. ~Kasun
          Hide
          Weihua Jiang added a comment -

          Hi Andrew,

          How about the progress of integration now? Where can I find your patch? I am very interested in this feature. I think I can provide some help on your work.

          Show
          Weihua Jiang added a comment - Hi Andrew, How about the progress of integration now? Where can I find your patch? I am very interested in this feature. I think I can provide some help on your work.
          Hide
          John Sichi added a comment -

          See HIVE-2468 for changes which make the build work with Hadoop 0.23.

          Show
          John Sichi added a comment - See HIVE-2468 for changes which make the build work with Hadoop 0.23.
          Hide
          ravi bhatt added a comment -

          Has this progressed? @John Sichi is there a jar which i can use to test the functionality you talked about?

          Show
          ravi bhatt added a comment - Has this progressed? @John Sichi is there a jar which i can use to test the functionality you talked about?
          Hide
          John Sichi added a comment -

          1+2) The Hadoop jar naming convention changed in 0.21; I hit this too recently when trying out a build against 0.21. I futzed around with the Hive build and got it working quick-and-dirty, but didn't save the patch. Looks like someone has submitted one on HIVE-1612 (I haven't taken a look at it yet). If you want to help push that through, it would be a good contribution by itself.

          3) Is it possible to make it work against Derby?

          Show
          John Sichi added a comment - 1+2) The Hadoop jar naming convention changed in 0.21; I hit this too recently when trying out a build against 0.21. I futzed around with the Hive build and got it working quick-and-dirty, but didn't save the patch. Looks like someone has submitted one on HIVE-1612 (I haven't taken a look at it yet). If you want to help push that through, it would be a good contribution by itself. 3) Is it possible to make it work against Derby?
          Hide
          Andrew Wilson added a comment -

          I'm struggling a little getting this code integrated into the hive trunk. I am trying to follow the pattern established by the hbase-handler.

          1) Right now the storage handler is implemented using the org.apache.hadoop.mapreduce.lib.db package that was introduced in 0.21.0. Is there a way to build against this distro? I tried running "$ ant -Dhadoop.version=0.21.0 package" but the hadoop-core.jar couldn't be resolved.

          2) Is there a way to indicate in the build.xml only to build this jar if the minimum hadoop version requirement is met?

          3) A lot of the unit tests for this storage handler currently depend on a local MySql instance that the developers on my team all have available. I am unsure how to replicate this kind of testing resource in the hive trunk.

          Show
          Andrew Wilson added a comment - I'm struggling a little getting this code integrated into the hive trunk. I am trying to follow the pattern established by the hbase-handler. 1) Right now the storage handler is implemented using the org.apache.hadoop.mapreduce.lib.db package that was introduced in 0.21.0. Is there a way to build against this distro? I tried running "$ ant -Dhadoop.version=0.21.0 package" but the hadoop-core.jar couldn't be resolved. 2) Is there a way to indicate in the build.xml only to build this jar if the minimum hadoop version requirement is met? 3) A lot of the unit tests for this storage handler currently depend on a local MySql instance that the developers on my team all have available. I am unsure how to replicate this kind of testing resource in the hive trunk.
          Hide
          John Sichi added a comment -

          Thanks a lot, I've linked your PDF directly from the [[Hive/DesignDocs]] wiki page.

          Show
          John Sichi added a comment - Thanks a lot, I've linked your PDF directly from the [ [Hive/DesignDocs] ] wiki page.
          Hide
          Andrew Wilson added a comment -

          Hi all,

          I'm uploading a design doc for this storage handler. Please comment. Should I add this on the Hive wiki?

          I'm in the process of porting the code over to the hive trunk, and am planning to follow the same pattern used by the HBase and Cassandra storage handlers.

          Show
          Andrew Wilson added a comment - Hi all, I'm uploading a design doc for this storage handler. Please comment. Should I add this on the Hive wiki? I'm in the process of porting the code over to the hive trunk, and am planning to follow the same pattern used by the HBase and Cassandra storage handlers.
          Hide
          Tim Perkins added a comment -

          hey... you need to get off this email address. I don't know who on your
          team is improperly claiming this address as their own, but they're mistaken.

          Please remove this address from your system.

          Show
          Tim Perkins added a comment - hey... you need to get off this email address. I don't know who on your team is improperly claiming this address as their own, but they're mistaken. Please remove this address from your system.
          Hide
          Andrew Wilson added a comment -

          Hi,

          Can I get this issue assigned to me? I have a basic implementation working, which I'd like to contribute.

          It wraps the DBInputFormat and DBOutputFormat classes.

          It expects values for the DBConfiguration properties to be provided through the SERDEPROPERTIES block in the create table statement. The configureTableJobProperties() method copies these properties out of the table description and into each job context.

          It also allows users to set SerDe properties which will cause the DBOutputFormat to generate UPSERT sql statements or DELETE sql statements instead of the vanilla INSERT sql generated by default. Right now this feature has a MySql bias. I am still trying to decide what the best way is to make this more database vendor agnostic.

          Andrew

          Show
          Andrew Wilson added a comment - Hi, Can I get this issue assigned to me? I have a basic implementation working, which I'd like to contribute. It wraps the DBInputFormat and DBOutputFormat classes. It expects values for the DBConfiguration properties to be provided through the SERDEPROPERTIES block in the create table statement. The configureTableJobProperties() method copies these properties out of the table description and into each job context. It also allows users to set SerDe properties which will cause the DBOutputFormat to generate UPSERT sql statements or DELETE sql statements instead of the vanilla INSERT sql generated by default. Right now this feature has a MySql bias. I am still trying to decide what the best way is to make this more database vendor agnostic. Andrew
          Hide
          Edward Capriolo added a comment -

          I wonder if this could end up being a very effective way to query shared data stores.

          I think I saw something like this in futurama.. Dont worry about querying blank, let me worry about querying blank.

          http://www.google.com/url?sa=t&source=web&cd=2&ved=0CBcQFjAB&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DB5cAwTEEGNE&ei=Qk9sTLAThIqXB__DzDw&usg=AFQjCNH_TOUS1cl6t0gZXefRURw0a_feZg

          Show
          Edward Capriolo added a comment - I wonder if this could end up being a very effective way to query shared data stores. I think I saw something like this in futurama.. Dont worry about querying blank, let me worry about querying blank. http://www.google.com/url?sa=t&source=web&cd=2&ved=0CBcQFjAB&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DB5cAwTEEGNE&ei=Qk9sTLAThIqXB__DzDw&usg=AFQjCNH_TOUS1cl6t0gZXefRURw0a_feZg
          Show
          John Sichi added a comment - For an implementation possibility, see http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html
          Hide
          Tim Perkins added a comment -

          This sounds great. We would love to be able to easily integrate our existing RDBMS reporting data directly into Hive. Getting everything from one frontend connected to Hive would make things much simpler.

          Show
          Tim Perkins added a comment - This sounds great. We would love to be able to easily integrate our existing RDBMS reporting data directly into Hive. Getting everything from one frontend connected to Hive would make things much simpler.

            People

            • Assignee:
              Unassigned
              Reporter:
              Bob Robertson
            • Votes:
              13 Vote for this issue
              Watchers:
              33 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development