Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: Examples
    • Labels:

      Description

      http://hadoop.apache.org/hive/ is a project that runs SQL queries against Hadoop map/reduce clusters. (For analytics; it is too high-latency to run applications against Hive directly). HIVE-705 added support for backends other than HDFS, with HBase as the first. Cassandra support should be doable too now.

      The Hive storage backends are described in http://wiki.apache.org/hadoop/Hive/StorageHandlers and the HBase backend specifically in http://wiki.apache.org/hadoop/Hive/HBaseIntegration.

      I also note that John Sichi, author of the HBase backend, seems like a helpful guy and I imagine would be totally cool with answering questions about implementation details.

        Issue Links

          Activity

          Jonathan Ellis created issue -
          Jonathan Ellis made changes -
          Field Original Value New Value
          Summary Add Hive support for Cassandra Add Hive support
          Description Hive is a project that runs SQL queries against Hadoop map/reduce clusters. HIVE-705 added support for backends other than HDFS, with HBase as the first. Cassandra support should be doable too now.

          The Hive storage backends are described in http://wiki.apache.org/hadoop/Hive/StorageHandlers and the HBase backend specifically in http://wiki.apache.org/hadoop/Hive/HBaseIntegration.

          I also note that John Sichi, author of the HBase backend, seems like a helpful guy and I imagine would be totally cool with answering questions about implementation details.
          http://hadoop.apache.org/hive/ is a project that runs SQL queries against Hadoop map/reduce clusters. (For analytics; it is too high-latency to run applications against Hive directly). HIVE-705 added support for backends other than HDFS, with HBase as the first. Cassandra support should be doable too now.

          The Hive storage backends are described in http://wiki.apache.org/hadoop/Hive/StorageHandlers and the HBase backend specifically in http://wiki.apache.org/hadoop/Hive/HBaseIntegration.

          I also note that John Sichi, author of the HBase backend, seems like a helpful guy and I imagine would be totally cool with answering questions about implementation details.
          p shirish reddy made changes -
          Link This issue is cloned as CASSANDRA-931 [ CASSANDRA-931 ]
          p shirish reddy made changes -
          Link This issue is cloned as CASSANDRA-931 [ CASSANDRA-931 ]
          Hide
          Jonathan Ellis added a comment -

          Starting points:

          The Cassandra inputformat for Hadoop is in org.apache.cassandra.hadoop.ColumnFamilyInputFormat; the record reader and input split are in the same package. There's an example of using these in contrib/word_count, and Pig integration in contrib/pig.

          You can look at the .7 patch to HIVE-705 to see how HBase support was added. Unfortunately this is not split into "Hive infrastructure refactoring" and "HBase support," they are all mixed in together.

          Show
          Jonathan Ellis added a comment - Starting points: The Cassandra inputformat for Hadoop is in org.apache.cassandra.hadoop.ColumnFamilyInputFormat; the record reader and input split are in the same package. There's an example of using these in contrib/word_count, and Pig integration in contrib/pig. You can look at the .7 patch to HIVE-705 to see how HBase support was added. Unfortunately this is not split into "Hive infrastructure refactoring" and "HBase support," they are all mixed in together.
          Hide
          John Sichi added a comment -

          Regarding HIVE-705, all files under hbase-handler constitute the HBase support, and the rest is Hive infrastructure refactoring, so you can use that split for reviewing them separately.

          Show
          John Sichi added a comment - Regarding HIVE-705 , all files under hbase-handler constitute the HBase support, and the rest is Hive infrastructure refactoring, so you can use that split for reviewing them separately.
          Hide
          Jonathan Ellis added a comment -

          Awesome, thanks for pointing that out!

          Show
          Jonathan Ellis added a comment - Awesome, thanks for pointing that out!
          Hide
          p shirish reddy added a comment -

          I have submitted the proposal for the same as a part of GSOC project. The link to the proposal is http://socghop.appspot.com/gsoc/student_proposal/private/google/gsoc2010/shirish_reddy_89/t127072582147. I'd like suggestions and comments.

          Show
          p shirish reddy added a comment - I have submitted the proposal for the same as a part of GSOC project. The link to the proposal is http://socghop.appspot.com/gsoc/student_proposal/private/google/gsoc2010/shirish_reddy_89/t127072582147 . I'd like suggestions and comments.
          Jeremy Hanna made changes -
          Link This issue is related to HIVE-1434 [ HIVE-1434 ]
          Hide
          Jonathan Ellis added a comment -

          closing in favor of HIVE-1434

          Show
          Jonathan Ellis added a comment - closing in favor of HIVE-1434
          Jonathan Ellis made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Hide
          Nicolas Lalevée added a comment -

          I cannot reopen this issue, so I'll just comment.

          As suggested by Jonathan in HIVE-1434, an hive/cassandra bridge may better fit here.

          I have finally found the source of Brisk's implementation (https://github.com/riptano/hive). The patch I am submitting here (CASSANDRA-913-r1199213.patch) is based on their work. So I cannot grant any license here.

          What I did on the original source:

          • I changed the package names (for some classes, some package access was needed)
          • add ASL2 headers for the ASF
          • format the code according to cassandra standard
          • change some logger from log4j and commons logging to slf4j
          • it didn't handle well nulls in hive tables, I have fixed that for the little tests I did.

          About the build, it needs hive jars in contrib/hive/lib. I don't know how to better setup this since those jars are not available in the maven repo.

          About runtime, I had a lot of trouble due to some conflict between the thrift library used by hive and the one used by cassandra. hive 0.7 is using the 0.5, cassandra the 0.6. Cassandra external table in hive could not be declared due to some NoSuchMethodException.
          As far as I understand hive, hive need thrift at job runtime just for handling dynamic column serialization. In my use case I didn't needed it so I did some hack: I remove every org.apache.thrift class from hive-exec.jar. Then it works nicely (for my use case).

          There were some tests in the github repo. They are Hive oriented. I'm too lazy to try to make then work in cassandra's source tree.

          With Hive 0.8, it will use thrift 0.7 (hopefully backward compatible with 0.6), and hive artifacts will be published on the maven repository (HIVE-1095). So probably it will be best to wait for easier integration in cassandra ?

          Show
          Nicolas Lalevée added a comment - I cannot reopen this issue, so I'll just comment. As suggested by Jonathan in HIVE-1434 , an hive/cassandra bridge may better fit here. I have finally found the source of Brisk's implementation ( https://github.com/riptano/hive ). The patch I am submitting here ( CASSANDRA-913 -r1199213.patch) is based on their work. So I cannot grant any license here. What I did on the original source: I changed the package names (for some classes, some package access was needed) add ASL2 headers for the ASF format the code according to cassandra standard change some logger from log4j and commons logging to slf4j it didn't handle well nulls in hive tables, I have fixed that for the little tests I did. About the build, it needs hive jars in contrib/hive/lib. I don't know how to better setup this since those jars are not available in the maven repo. About runtime, I had a lot of trouble due to some conflict between the thrift library used by hive and the one used by cassandra. hive 0.7 is using the 0.5, cassandra the 0.6. Cassandra external table in hive could not be declared due to some NoSuchMethodException. As far as I understand hive, hive need thrift at job runtime just for handling dynamic column serialization. In my use case I didn't needed it so I did some hack: I remove every org.apache.thrift class from hive-exec.jar. Then it works nicely (for my use case). There were some tests in the github repo. They are Hive oriented. I'm too lazy to try to make then work in cassandra's source tree. With Hive 0.8, it will use thrift 0.7 (hopefully backward compatible with 0.6), and hive artifacts will be published on the maven repository ( HIVE-1095 ). So probably it will be best to wait for easier integration in cassandra ?
          Nicolas Lalevée made changes -
          Attachment CASSANDRA-913-r1199213.patch [ 12502913 ]
          Hide
          T Jake Luciani added a comment -

          Hi Nicholas,

          Thanks for looking at this. As you mention we need to figure out how to get the tests working locally. This probably requires the hive test artifacts to be deployed in maven.

          We are currently using the cassandra-1.0 branch on github so that should have the latest changes. Cassandra 1.1 will be upgrading to thrift 0.7 CASSANDRA-3213 at which point we should work with Hive 0.8 without conflicts.

          Show
          T Jake Luciani added a comment - Hi Nicholas, Thanks for looking at this. As you mention we need to figure out how to get the tests working locally. This probably requires the hive test artifacts to be deployed in maven. We are currently using the cassandra-1.0 branch on github so that should have the latest changes. Cassandra 1.1 will be upgrading to thrift 0.7 CASSANDRA-3213 at which point we should work with Hive 0.8 without conflicts.
          Sam Tunnicliffe made changes -
          Link This issue relates CASSANDRA-4613 [ CASSANDRA-4613 ]
          Brandon Williams made changes -
          Link This issue relates to CASSANDRA-4613 [ CASSANDRA-4613 ]
          Brandon Williams made changes -
          Link This issue relates CASSANDRA-4613 [ CASSANDRA-4613 ]
          Sam Tunnicliffe made changes -
          Link This issue relates to CASSANDRA-4613 [ CASSANDRA-4613 ]
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12502652 ] patch-available, re-open possible [ 12753288 ]
          Gavin made changes -
          Workflow patch-available, re-open possible [ 12753288 ] reopen-resolved, no closed status, patch-avail, testing [ 12758675 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development