Cassandra
  1. Cassandra
  2. CASSANDRA-4131

Integrate Hive support to be in core cassandra

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      The standalone hive support (at https://github.com/riptano/hive) would be great to have in-tree so that people don't have to go out to github to download it and wonder if it's a left-for-dead external shim.

        Activity

        Hide
        Edward Capriolo added a comment -

        Yes. We need to get this in tree somewhere, hive, cassandra. It really kills our evolution and makes it hard to manage projects such as: https://github.com/edwardcapriolo/hive_cassandra_udfs. I think we should put this code in Cassandra's tree. Use Hive From Maven to build, and I have a test kit https://github.com/edwardcapriolo/hive_test. That we can bring up an embedded hive for integration testing.

        What version of Cassandra and Hive should we target? I will assign to myself for now because I am very interested in seeing this happen.

        Show
        Edward Capriolo added a comment - Yes. We need to get this in tree somewhere, hive, cassandra. It really kills our evolution and makes it hard to manage projects such as: https://github.com/edwardcapriolo/hive_cassandra_udfs . I think we should put this code in Cassandra's tree. Use Hive From Maven to build, and I have a test kit https://github.com/edwardcapriolo/hive_test . That we can bring up an embedded hive for integration testing. What version of Cassandra and Hive should we target? I will assign to myself for now because I am very interested in seeing this happen.
        Hide
        T Jake Luciani added a comment -

        I think most of the work will be making a stand along build.xml to fetch the hive maven artifacts and create the cassandra-handler.jar, I think we just drop the hive test suite and integrate our own.

        Show
        T Jake Luciani added a comment - I think most of the work will be making a stand along build.xml to fetch the hive maven artifacts and create the cassandra-handler.jar, I think we just drop the hive test suite and integrate our own.
        Hide
        Edward Capriolo added a comment -

        Agreed. What versions of Hive and Cassandra should we target. Hive 0.8.0 and Cassandra 1.1.0? Where exactly do the latest Hive handler sources live?

        Show
        Edward Capriolo added a comment - Agreed. What versions of Hive and Cassandra should we target. Hive 0.8.0 and Cassandra 1.1.0? Where exactly do the latest Hive handler sources live?
        Hide
        T Jake Luciani added a comment -

        The latest code is https://github.com/riptano/hive/tree/hive-0.8.1-merge

        The cassandra version should be trunk (1.1) since it uses same version thrift as hive 0.7.0

        The only thing I want todo it put the CassandraProxyClient code into the main Cassandra tree and use that for hadoop calls since it's much more reliable for us. The hive driver currently depends on it's own version of that class.

        Show
        T Jake Luciani added a comment - The latest code is https://github.com/riptano/hive/tree/hive-0.8.1-merge The cassandra version should be trunk (1.1) since it uses same version thrift as hive 0.7.0 The only thing I want todo it put the CassandraProxyClient code into the main Cassandra tree and use that for hadoop calls since it's much more reliable for us. The hive driver currently depends on it's own version of that class.
        Hide
        Chris Romary added a comment -

        Wondering about the status of Cassandra/Hive integration... is it a 'left-for-dead external shim' or something that's still actively being worked on? None of the githubs mentioned above have commits in the last 10 months or so. What is the status of Hive/Cassandra 1.2?

        Show
        Chris Romary added a comment - Wondering about the status of Cassandra/Hive integration... is it a 'left-for-dead external shim' or something that's still actively being worked on? None of the githubs mentioned above have commits in the last 10 months or so. What is the status of Hive/Cassandra 1.2?
        Hide
        Christian Moen added a comment -

        I'm also curious to know what the plans are here. Thanks for any info.

        Show
        Christian Moen added a comment - I'm also curious to know what the plans are here. Thanks for any info.
        Hide
        Dmitry Vasilenko added a comment -

        This can be of some interest:

        https://github.com/dvasilen/Hive-Cassandra/blob/HIVE-0.10.0-CASSANDRA-1.2.4/release/hive-0.10.0-cassandra-1.2.4.jar

        https://github.com/dvasilen/Hive-Cassandra/blob/HIVE-0.9.0-CASSANDRA-1.2.4/release/hive-0.9.0-cassandra-1.2.4.jar

        I was testing Cassandra 1.2.3/Hive 0.10.0/HCatalog 0.5.0 and had to recompile the code of the storage handler to make it work with the latest versions.

        Show
        Dmitry Vasilenko added a comment - This can be of some interest: https://github.com/dvasilen/Hive-Cassandra/blob/HIVE-0.10.0-CASSANDRA-1.2.4/release/hive-0.10.0-cassandra-1.2.4.jar https://github.com/dvasilen/Hive-Cassandra/blob/HIVE-0.9.0-CASSANDRA-1.2.4/release/hive-0.9.0-cassandra-1.2.4.jar I was testing Cassandra 1.2.3/Hive 0.10.0/HCatalog 0.5.0 and had to recompile the code of the storage handler to make it work with the latest versions.
        Hide
        Jonathan Ellis added a comment -

        Is that from Jake's branch? I'm kind of surprised if you didn't need more than a recompile.

        Show
        Jonathan Ellis added a comment - Is that from Jake's branch? I'm kind of surprised if you didn't need more than a recompile.
        Hide
        Dmitry Vasilenko added a comment -

        I had to refactor the code slightly to conform to the new APIs but other than that it was relatively straightforward.

        Show
        Dmitry Vasilenko added a comment - I had to refactor the code slightly to conform to the new APIs but other than that it was relatively straightforward.
        Hide
        Oliver Zhou added a comment -

        Hi Dmitry,

        I try your build with cassandra 1.2.3/hive 0.9.0, I have a issue that I always get the duplicated records in Hive.

        Cassandra column family:
        CREATE COLUMN FAMILY users
        WITH comparator = UTF8Type
        AND key_validation_class=UTF8Type
        AND column_metadata = [

        {column_name: full_name, validation_class: UTF8Type} {column_name: email, validation_class: UTF8Type} {column_name: state, validation_class: UTF8Type} {column_name: gender, validation_class: UTF8Type} {column_name: birth_year, validation_class: LongType}

        ];

        Hive Table:
        CREATE EXTERNAL TABLE IF NOT EXISTS
        users (key string, full_name string)
        STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
        WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,users:full_name" , "cassandra.cf.name" = "users")
        TBLPROPERTIES ("cassandra.ks.name" = "ks33");

        Hive Query:
        select * from users;
        always return duplicated rows (one row appears twice)

        select count(1) from users;
        return 2 but exactly I only insert one row.

        Do you have any idea why this happen?

        Show
        Oliver Zhou added a comment - Hi Dmitry, I try your build with cassandra 1.2.3/hive 0.9.0, I have a issue that I always get the duplicated records in Hive. Cassandra column family: CREATE COLUMN FAMILY users WITH comparator = UTF8Type AND key_validation_class=UTF8Type AND column_metadata = [ {column_name: full_name, validation_class: UTF8Type} {column_name: email, validation_class: UTF8Type} {column_name: state, validation_class: UTF8Type} {column_name: gender, validation_class: UTF8Type} {column_name: birth_year, validation_class: LongType} ]; Hive Table: CREATE EXTERNAL TABLE IF NOT EXISTS users (key string, full_name string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,users:full_name" , "cassandra.cf.name" = "users") TBLPROPERTIES ("cassandra.ks.name" = "ks33"); Hive Query: select * from users; always return duplicated rows (one row appears twice) select count(1) from users; return 2 but exactly I only insert one row. Do you have any idea why this happen?
        Hide
        Cyril Scetbon added a comment -

        Any news about this last issue with duplicate rows ?

        Show
        Cyril Scetbon added a comment - Any news about this last issue with duplicate rows ?
        Hide
        Rohit Rai added a comment -

        Hey, I managed to get the hive cassandra handler to compile against 1.2.6 and all the test cases from Datastax hive repo for it are passing...

        The code is here -
        https://github.com/milliondreams/hive/tree/cas-support/cassandra-handler

        I am facing the same issue that Oliver and Cyril mention about the same row appearing twice in a mapped column.

        Will start debugging that tomorrow, but will be great if someone can point me in the right direction.

        Show
        Rohit Rai added a comment - Hey, I managed to get the hive cassandra handler to compile against 1.2.6 and all the test cases from Datastax hive repo for it are passing... The code is here - https://github.com/milliondreams/hive/tree/cas-support/cassandra-handler I am facing the same issue that Oliver and Cyril mention about the same row appearing twice in a mapped column. Will start debugging that tomorrow, but will be great if someone can point me in the right direction.
        Hide
        Cyril Scetbon added a comment - - edited

        Are the duplicates just tombstones not filtered out as said at https://issues.apache.org/jira/browse/CASSANDRA-4421?focusedCommentId=13658450&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13658450 ? If yes, we should use Column.isLive() function to identify and skip them

        Show
        Cyril Scetbon added a comment - - edited Are the duplicates just tombstones not filtered out as said at https://issues.apache.org/jira/browse/CASSANDRA-4421?focusedCommentId=13658450&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13658450 ? If yes, we should use Column.isLive() function to identify and skip them
        Hide
        Rohit Rai added a comment -

        So, I did get sometime today to debug this...

        They are not tombstones, as they are present even for that have never been edited. "EVERY" row is repeated. I tried using the CFIF and HiveCassandraStandardColumnInputFormat directly in a Map/Reduce program, but they didn't give he duplicates.

        So it must be something in the CassandraStorageHandler.

        Will look more into it later.

        Show
        Rohit Rai added a comment - So, I did get sometime today to debug this... They are not tombstones, as they are present even for that have never been edited. "EVERY" row is repeated. I tried using the CFIF and HiveCassandraStandardColumnInputFormat directly in a Map/Reduce program, but they didn't give he duplicates. So it must be something in the CassandraStorageHandler. Will look more into it later.
        Hide
        Rohit Rai added a comment -

        So I figured it out...

        This is being caused by the difference in Partitioners.

        I am using Murmur3 Partitioner on my CF, but since I didn't specify it during Table creating in Hive, the Hive Input Format defaults to using Random partitioner!

        My bad, I didn't notice it while testing in Map/Reduce as I set the Partition there.

        I will be testing it extensively over next few weeks, including one deployment in production Will update if there are any more issues.

        Is there any interest in getting this in Cassandra? If yes, I can make a patch and submit.

        On second thought, shouldn't it default to Murmur3, since it is the default partitioner now?

        Show
        Rohit Rai added a comment - So I figured it out... This is being caused by the difference in Partitioners. I am using Murmur3 Partitioner on my CF, but since I didn't specify it during Table creating in Hive, the Hive Input Format defaults to using Random partitioner! My bad, I didn't notice it while testing in Map/Reduce as I set the Partition there. I will be testing it extensively over next few weeks, including one deployment in production Will update if there are any more issues. Is there any interest in getting this in Cassandra? If yes, I can make a patch and submit. On second thought, shouldn't it default to Murmur3, since it is the default partitioner now?
        Hide
        Cyril Scetbon added a comment - - edited

        Great ! I think it should default to Murmur3 as it's built for 1.2.6
        I've run your tests and they failed http://pastebin.com/LqQJLjzn. What are the pre-requirements ? launching Cassandra with the configuration files you provide ? I was hoping that your tests would launch it themselves, or mock Cassandra's functions

        Show
        Cyril Scetbon added a comment - - edited Great ! I think it should default to Murmur3 as it's built for 1.2.6 I've run your tests and they failed http://pastebin.com/LqQJLjzn . What are the pre-requirements ? launching Cassandra with the configuration files you provide ? I was hoping that your tests would launch it themselves, or mock Cassandra's functions
        Hide
        Nicolas Lalevée added a comment -

        FYI, I have used the Hive/Cassandra storage patch for some time now, only for writes, and I had to modify some lines to make it work properly. I never took time to figure out if it was because of my environment or if this is a real bug. Maybe you should look into it.

        It is about the precision of the timestamp. In the patches, the timestamp is set to System.currentTimeMillis(). And as far as I understand, the command line client of cassandra precision is in micro seconds. So if a write happen once in the command line client, every writes from Hive will be ignore.

        For instance, here are lines which I patched:
        https://github.com/milliondreams/hive/blob/cas-support/cassandra-handler/src/main/java/org/apache/hadoop/hive/cassandra/serde/RegularTableMapping.java#L84
        https://github.com/milliondreams/hive/blob/cas-support/cassandra-handler/src/main/java/org/apache/hadoop/hive/cassandra/serde/RegularTableMapping.java#L94
        https://github.com/milliondreams/hive/blob/cas-support/cassandra-handler/src/main/java/org/apache/hadoop/hive/cassandra/serde/TransposedMapping.java#L45
        https://github.com/milliondreams/hive/blob/cas-support/cassandra-handler/src/main/java/org/apache/hadoop/hive/cassandra/serde/TransposedMapping.java#L63

        Instead I used: cc.setTimeStamp(FBUtilities.timestampMicros());

        Show
        Nicolas Lalevée added a comment - FYI, I have used the Hive/Cassandra storage patch for some time now, only for writes, and I had to modify some lines to make it work properly. I never took time to figure out if it was because of my environment or if this is a real bug. Maybe you should look into it. It is about the precision of the timestamp. In the patches, the timestamp is set to System.currentTimeMillis() . And as far as I understand, the command line client of cassandra precision is in micro seconds. So if a write happen once in the command line client, every writes from Hive will be ignore. For instance, here are lines which I patched: https://github.com/milliondreams/hive/blob/cas-support/cassandra-handler/src/main/java/org/apache/hadoop/hive/cassandra/serde/RegularTableMapping.java#L84 https://github.com/milliondreams/hive/blob/cas-support/cassandra-handler/src/main/java/org/apache/hadoop/hive/cassandra/serde/RegularTableMapping.java#L94 https://github.com/milliondreams/hive/blob/cas-support/cassandra-handler/src/main/java/org/apache/hadoop/hive/cassandra/serde/TransposedMapping.java#L45 https://github.com/milliondreams/hive/blob/cas-support/cassandra-handler/src/main/java/org/apache/hadoop/hive/cassandra/serde/TransposedMapping.java#L63 Instead I used: cc.setTimeStamp(FBUtilities.timestampMicros());
        Hide
        Cyril Scetbon added a comment - - edited

        Do all tests run without issues ? I can't make them run successfully (java 6 or 7) :

        Tests in error:
        CassandraFileSystemTest.testFileSystemWithoutFlush:63->testFileSystem:74 ? IO ...
        CassandraFileSystemTest.testFileSystemWithFlush:68->testFileSystem:74 ? IO org...
        CassandraHiveMetaStoreTest.testSetConf:32 ? CassandraHiveMetaStore There was a...
        CassandraHiveMetaStoreTest.testCreateDeleteDatabaseAndTable:52 ? CassandraHiveMetaStore
        CassandraHiveMetaStoreTest.testFindEmptyPatitionList:78 ? CassandraHiveMetaStore
        CassandraHiveMetaStoreTest.testAlterTable:99 ? CassandraHiveMetaStore There wa...
        CassandraHiveMetaStoreTest.testAlterDatabaseTable:122 ? CassandraHiveMetaStore
        CassandraHiveMetaStoreTest.testAddParition:144 ? CassandraHiveMetaStore There ...
        CassandraHiveMetaStoreTest.testCreateMultipleDatabases:174 ? CassandraHiveMetaStore
        CassandraHiveMetaStoreTest.testAddDropReAddDatabase:186 ? CassandraHiveMetaStore
        CassandraHiveMetaStoreTest.testCaseInsensitiveNaming:207 ? CassandraHiveMetaStore
        CassandraHiveMetaStoreTest.testAutoCreateFromKeyspace:229 ? TTransport java.ne...
        MetaStorePersisterTest.testBasicPersistMetaStoreEntity:52->setupClient:44 ? CassandraHiveMetaStore
        MetaStorePersisterTest.testEntityNotFound ? Unexpected exception, expected<or...
        MetaStorePersisterTest.testBasicLoadMetaStoreEntity:73->setupClient:44 ? CassandraHiveMetaStore
        MetaStorePersisterTest.testFindMetaStoreEntities:89->setupClient:44 ? CassandraHiveMetaStore
        MetaStorePersisterTest.testEntityDeletion:116->setupClient:44 ? CassandraHiveMetaStore
        SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr...
        SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr...
        SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr...
        SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr...
        SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr...
        SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr...

        Tests run: 37, Failures: 0, Errors: 23, Skipped: 0

        Show
        Cyril Scetbon added a comment - - edited Do all tests run without issues ? I can't make them run successfully (java 6 or 7) : Tests in error: CassandraFileSystemTest.testFileSystemWithoutFlush:63->testFileSystem:74 ? IO ... CassandraFileSystemTest.testFileSystemWithFlush:68->testFileSystem:74 ? IO org... CassandraHiveMetaStoreTest.testSetConf:32 ? CassandraHiveMetaStore There was a... CassandraHiveMetaStoreTest.testCreateDeleteDatabaseAndTable:52 ? CassandraHiveMetaStore CassandraHiveMetaStoreTest.testFindEmptyPatitionList:78 ? CassandraHiveMetaStore CassandraHiveMetaStoreTest.testAlterTable:99 ? CassandraHiveMetaStore There wa... CassandraHiveMetaStoreTest.testAlterDatabaseTable:122 ? CassandraHiveMetaStore CassandraHiveMetaStoreTest.testAddParition:144 ? CassandraHiveMetaStore There ... CassandraHiveMetaStoreTest.testCreateMultipleDatabases:174 ? CassandraHiveMetaStore CassandraHiveMetaStoreTest.testAddDropReAddDatabase:186 ? CassandraHiveMetaStore CassandraHiveMetaStoreTest.testCaseInsensitiveNaming:207 ? CassandraHiveMetaStore CassandraHiveMetaStoreTest.testAutoCreateFromKeyspace:229 ? TTransport java.ne... MetaStorePersisterTest.testBasicPersistMetaStoreEntity:52->setupClient:44 ? CassandraHiveMetaStore MetaStorePersisterTest.testEntityNotFound ? Unexpected exception, expected<or... MetaStorePersisterTest.testBasicLoadMetaStoreEntity:73->setupClient:44 ? CassandraHiveMetaStore MetaStorePersisterTest.testFindMetaStoreEntities:89->setupClient:44 ? CassandraHiveMetaStore MetaStorePersisterTest.testEntityDeletion:116->setupClient:44 ? CassandraHiveMetaStore SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr... SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr... SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr... SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr... SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr... SchemaManagerServiceTest.setupLocal:45 ? CassandraHiveMetaStore There was a pr... Tests run: 37, Failures: 0, Errors: 23, Skipped: 0
        Hide
        Rohit Rai added a comment -

        Sorry for the mess there, I was just trying to port CFS and Hive metastore too... but those tests don't work right now, so put it on hold, getting it to work with CQL3 Column Families is a priority for me right now, so will come back to those later.

        Just for the Hive handler, please look at the (cas-support-simple-hive) branch -
        https://github.com/milliondreams/hive/tree/cas-support-simple-hive

        All the test cases (whatever few they had) pass there and it is working perfectly with Thrift/Compact storage Column Families.

        Show
        Rohit Rai added a comment - Sorry for the mess there, I was just trying to port CFS and Hive metastore too... but those tests don't work right now, so put it on hold, getting it to work with CQL3 Column Families is a priority for me right now, so will come back to those later. Just for the Hive handler, please look at the (cas-support-simple-hive) branch - https://github.com/milliondreams/hive/tree/cas-support-simple-hive All the test cases (whatever few they had) pass there and it is working perfectly with Thrift/Compact storage Column Families.
        Hide
        Cyril Scetbon added a comment -

        getting it to work with CQL3 Column Families is a priority for me right now

        Okay, that's exactly the feature I'm waiting for You should find inspiration in CASSANDRA-5234 like paging

        Show
        Cyril Scetbon added a comment - getting it to work with CQL3 Column Families is a priority for me right now Okay, that's exactly the feature I'm waiting for You should find inspiration in CASSANDRA-5234 like paging
        Hide
        Rohit Rai added a comment -

        Actually, Hive support internally uses the Cassandra Hadoop Input format... and thankfully we now have CqlPagingInputFormat support in 1.2.6.

        So I have got the basic CQL3 column family support(reading) in, and it is working. Haven't done extensive testing and need to write some test cases... But I could run it with CQL Column Families with Simple as well as Composite primary keys. The code is here if you want to give it a try. https://github.com/milliondreams/hive/tree/cas-support-cql

        Show
        Rohit Rai added a comment - Actually, Hive support internally uses the Cassandra Hadoop Input format... and thankfully we now have CqlPagingInputFormat support in 1.2.6. So I have got the basic CQL3 column family support(reading) in, and it is working. Haven't done extensive testing and need to write some test cases... But I could run it with CQL Column Families with Simple as well as Composite primary keys. The code is here if you want to give it a try. https://github.com/milliondreams/hive/tree/cas-support-cql
        Hide
        Cyril Scetbon added a comment - - edited

        Did you have more time to test it ? I'll give it a try asaic

        Show
        Cyril Scetbon added a comment - - edited Did you have more time to test it ? I'll give it a try asaic
        Hide
        Cyril Scetbon added a comment - - edited

        I've met a performance issue where there is a few data. In my example, I have only a few rows :

        cqlsh>select count(*) from light_column;
        
         count
        -------
             4
        

        It takes less than a second with cqlsh whereas it takes near 600 seconds with Hive. Please see logs at http://pastebin.com/ippy96GY
        There are 257 mappers (to scan data from 256 vnodes) and they took a lot of CPU even if the process says at the end :
        Total MapReduce CPU Time Spent: 0 msec

        Another issue is that the count number is false as it returns 5 instead of 4, and it's caused by a deleted row counted as alive !

        Show
        Cyril Scetbon added a comment - - edited I've met a performance issue where there is a few data. In my example, I have only a few rows : cqlsh>select count(*) from light_column; count ------- 4 It takes less than a second with cqlsh whereas it takes near 600 seconds with Hive. Please see logs at http://pastebin.com/ippy96GY There are 257 mappers (to scan data from 256 vnodes) and they took a lot of CPU even if the process says at the end : Total MapReduce CPU Time Spent: 0 msec Another issue is that the count number is false as it returns 5 instead of 4, and it's caused by a deleted row counted as alive !
        Hide
        Cyril Scetbon added a comment -

        I've added 2 commits available at https://github.com/cscetbon/hive which :

        • set default partitioner to Murmur3
        • skip deleted rows when reading from Cassandra
          I also sent you a pull request on Github.
        Show
        Cyril Scetbon added a comment - I've added 2 commits available at https://github.com/cscetbon/hive which : set default partitioner to Murmur3 skip deleted rows when reading from Cassandra I also sent you a pull request on Github.
        Hide
        Cyril Scetbon added a comment -

        For the performance issue I suppose it's simply inherent to Hadoop internals which are designed for lot of data and not for a few rows.

        Show
        Cyril Scetbon added a comment - For the performance issue I suppose it's simply inherent to Hadoop internals which are designed for lot of data and not for a few rows.
        Hide
        Cyril Scetbon added a comment -

        The tests I made show that CQL3 tables are not seen when I try to create the external table in Hive. This is due to the fact that Thrift does not return CQL3 tables and that you use it (through describe_keyspace) to get column families definitions.

        Show
        Cyril Scetbon added a comment - The tests I made show that CQL3 tables are not seen when I try to create the external table in Hive. This is due to the fact that Thrift does not return CQL3 tables and that you use it (through describe_keyspace) to get column families definitions.
        Hide
        Rohit Rai added a comment -

        What do you mean by not able to see external table?

        We are infact using a CQL query to get CF names,
        "select columnfamily_name from system.schema_columnfamilies where keyspace_name='%s';"

        Are you looking at the cql branch?

        https://github.com/milliondreams/hive/blob/cas-support-cql

        Show
        Rohit Rai added a comment - What do you mean by not able to see external table? We are infact using a CQL query to get CF names, "select columnfamily_name from system.schema_columnfamilies where keyspace_name='%s';" Are you looking at the cql branch? https://github.com/milliondreams/hive/blob/cas-support-cql
        Hide
        Rohit Rai added a comment -

        On another note, now we have completed Select and Insert working. We also have support for Create table when one doesn't exist in C*.

        I noticed in this blog,
        http://www.planetcassandra.org/blog/post/support-cql3-tables-in-hadoop-pig-and-hive

        Is this only in DSE? Does Datastax plan to release it?

        Show
        Rohit Rai added a comment - On another note, now we have completed Select and Insert working. We also have support for Create table when one doesn't exist in C*. I noticed in this blog, http://www.planetcassandra.org/blog/post/support-cql3-tables-in-hadoop-pig-and-hive Is this only in DSE? Does Datastax plan to release it?
        Hide
        Cyril Scetbon added a comment - - edited

        Okay I'm using your branch (cas-support-cql) but I was using CassandraStorageHandler instead of CqlStorageHandler in the Hive DDL command Is there a documentation somewhere which describes how to use your driver ? I don't have any issue when I use CassandraStorageHandler with a cql2 table, but when I use CqlStorageHandler with a cql3 table I get :

        java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.io.MapWritable cannot be cast to org.apache.hadoop.io.WritableComparable
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)

        FYI, I use Hadoop 1.2.1 (as Hadoop 2 is not supported) and tried with both hive 0.10.0 and 0.11.0. I also tried with Hadoop 0.20.2 as you use version 0.20.205.0 as a dependancy in the maven configuration file (pom.xml).

        Any idea ?
        You can find the complete trace at http://pastebin.com/0wCL3132

        Show
        Cyril Scetbon added a comment - - edited Okay I'm using your branch (cas-support-cql) but I was using CassandraStorageHandler instead of CqlStorageHandler in the Hive DDL command Is there a documentation somewhere which describes how to use your driver ? I don't have any issue when I use CassandraStorageHandler with a cql2 table, but when I use CqlStorageHandler with a cql3 table I get : java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.io.MapWritable cannot be cast to org.apache.hadoop.io.WritableComparable at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521) FYI, I use Hadoop 1.2.1 (as Hadoop 2 is not supported) and tried with both hive 0.10.0 and 0.11.0. I also tried with Hadoop 0.20.2 as you use version 0.20.205.0 as a dependancy in the maven configuration file (pom.xml). Any idea ? You can find the complete trace at http://pastebin.com/0wCL3132
        Hide
        Cyril Scetbon added a comment -

        I dig into the code and it comes from the fact that in class org.apache.hadoop.hive.cassandra.input.cql.CqlHiveRecordReader, createKey() function returns a MapWritable object whereas in class org.apache.hadoop.hive.ql.exec.FetchOperator, function getRecordReader() tries to get the key with the following code

        key = currRecReader.createKey();

        But key is defined as a WritableComparable and so can't store a MapWritable object returned by CqlHiveRecordReader.createKey() function
        Tell me if I'm wrong or if you have some patches to apply

        Show
        Cyril Scetbon added a comment - I dig into the code and it comes from the fact that in class org.apache.hadoop.hive.cassandra.input.cql.CqlHiveRecordReader, createKey() function returns a MapWritable object whereas in class org.apache.hadoop.hive.ql.exec.FetchOperator, function getRecordReader() tries to get the key with the following code key = currRecReader.createKey(); But key is defined as a WritableComparable and so can't store a MapWritable object returned by CqlHiveRecordReader.createKey() function Tell me if I'm wrong or if you have some patches to apply
        Hide
        Rohit Rai added a comment -

        Sorry, haven't been able to give this my full attention in past month. One of our developers is working on it... and with the latest merge, most of the functionality should be working.

        You can try with this document,
        https://github.com/milliondreams/hive/blob/cas-support-cql/cassandra-handler/README

        Cyril, In the meanwhile when I look at the code, I also notice what you are saying, we will try to figure it out. Give it a try with the README and let us know how that goes.

        Show
        Rohit Rai added a comment - Sorry, haven't been able to give this my full attention in past month. One of our developers is working on it... and with the latest merge, most of the functionality should be working. You can try with this document, https://github.com/milliondreams/hive/blob/cas-support-cql/cassandra-handler/README Cyril, In the meanwhile when I look at the code, I also notice what you are saying, we will try to figure it out. Give it a try with the README and let us know how that goes.
        Hide
        Cyril Scetbon added a comment - - edited

        I tried your README documentation (I was almost doing the same things) and I got the same error http://pastebin.com/KTRPx2Fh. As you can see I got no error with the creation of the column family "messages" that didn't exist.
        I can't understand how your tests (SELECT * FROM xxx) successfully worked with this incompatible cast … Any more information about that ?

        Show
        Cyril Scetbon added a comment - - edited I tried your README documentation (I was almost doing the same things) and I got the same error http://pastebin.com/KTRPx2Fh . As you can see I got no error with the creation of the column family "messages" that didn't exist. I can't understand how your tests (SELECT * FROM xxx) successfully worked with this incompatible cast … Any more information about that ?
        Hide
        Cyril Scetbon added a comment -

        It's weird that this BUG has a Major Priority since April 2012 and that now there is a DataStax version not included in the trunk ...

        Show
        Cyril Scetbon added a comment - It's weird that this BUG has a Major Priority since April 2012 and that now there is a DataStax version not included in the trunk ...
        Hide
        Marcel added a comment -

        Will there be a update for the cassandra-handler for Cassandra 2.0.0?

        I have it working on cassandra 2.0.0 version (with a slight problem connecting to the cassandra cluster) but i'm noticing that the number of mappers is equal to the number of vhosts (default 256 per node). I think this should be equal to the number of nodes.

        The problem with connecting to the cassandra cluster was that the cassandra.host property past in de the create table statement in hive didn't get passed on. When I hardcoded it (replaced localhost with ip address) it worked.

        Show
        Marcel added a comment - Will there be a update for the cassandra-handler for Cassandra 2.0.0? I have it working on cassandra 2.0.0 version (with a slight problem connecting to the cassandra cluster) but i'm noticing that the number of mappers is equal to the number of vhosts (default 256 per node). I think this should be equal to the number of nodes. The problem with connecting to the cassandra cluster was that the cassandra.host property past in de the create table statement in hive didn't get passed on. When I hardcoded it (replaced localhost with ip address) it worked.
        Hide
        Alex McLintock added a comment -

        I can't access https://github.com/riptano/hive at all.... 404 error

        I'd love to know the status of this... Is it true that Cassandra can be used as a data source for Hadoop's Hive bot only if I use the DataStax code version of Cassandra?

        Show
        Alex McLintock added a comment - I can't access https://github.com/riptano/hive at all.... 404 error I'd love to know the status of this... Is it true that Cassandra can be used as a data source for Hadoop's Hive bot only if I use the DataStax code version of Cassandra?

          People

          • Assignee:
            Edward Capriolo
            Reporter:
            Jeremy Hanna
          • Votes:
            21 Vote for this issue
            Watchers:
            30 Start watching this issue

            Dates

            • Created:
              Updated:

              Development