Details

    • Type: Task Task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.6.0
    • Fix Version/s: None
    • Component/s: contrib
    • Labels:
      None

      Description

      Need to look into adding support for Accumulo to Hive

      1. ACCUMULO-143.patch
        188 kB
        Brian Femiano

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          535d 17h 5m 1 Brian Femiano 03/May/13 07:58
          Patch Available Patch Available Resolved Resolved
          476d 13h 44m 1 Sean Busbey 22/Aug/14 21:42
          Sean Busbey made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]
          Hide
          Sean Busbey added a comment -

          Now that HIVE-7068 has landed for Hive 0.14, closing this as wontfix.

          Show
          Sean Busbey added a comment - Now that HIVE-7068 has landed for Hive 0.14, closing this as wontfix.
          Hide
          Carl Austin added a comment -

          Also, I just realised I missed code in the above. Right at the top of the first snippet should be:

          List<String> columns = new ArrayList<String>();
          ParseDriver driver = new ParseDriver();
          
          Show
          Carl Austin added a comment - Also, I just realised I missed code in the above. Right at the top of the first snippet should be: List< String > columns = new ArrayList< String >(); ParseDriver driver = new ParseDriver();
          Hide
          Carl Austin added a comment -

          Thanks Josh Elser, great to know that there is going to be more progress on this. Coincidentally I think I've recently got INSERT working too, also mostly untested, but I'll take a look at what you've done for comparison sake when I get a minute.
          I noticed that this patch removes a lot of the parsing from the record reader, as well as the blank initialisation, something I had to also do to get any type of performance at scale. That combined with the column fetch has significantly improved the overall performance of this (a count(col) on 11 million values is down from 10s of minutes to a couple on my small test cluster), and I'm going to be looking at whether I can eek any more speed out of it in the coming week or so, as well as testing how it compares to plain file based external tables. If I find anything more I'll let you know.

          Show
          Carl Austin added a comment - Thanks Josh Elser , great to know that there is going to be more progress on this. Coincidentally I think I've recently got INSERT working too, also mostly untested, but I'll take a look at what you've done for comparison sake when I get a minute. I noticed that this patch removes a lot of the parsing from the record reader, as well as the blank initialisation, something I had to also do to get any type of performance at scale. That combined with the column fetch has significantly improved the overall performance of this (a count(col) on 11 million values is down from 10s of minutes to a couple on my small test cluster), and I'm going to be looking at whether I can eek any more speed out of it in the coming week or so, as well as testing how it compares to plain file based external tables. If I find anything more I'll let you know.
          Hide
          Josh Elser added a comment -

          David Medinets you should refer to the Hive Wiki for documentation on how to build the code.

          Show
          Josh Elser added a comment - David Medinets you should refer to the Hive Wiki for documentation on how to build the code.
          Hide
          David Medinets added a comment -

          I tried to 'mvn package' the project at
          https://github.com/joshelser/hive/tree/HIVE-7068 but ran into the following
          message for the 'common' module:

          [ERROR] Failed to execute goal on project hive-common: Could not resolve
          dependencies for project org.apache.hive:hive-common:jar:0.14.0-SNAPSHOT:
          Could not find artifact org.apache.hive:hive-shims:jar:0.14.0-SNAPSHOT in
          apache.snapshots (http://repository.apache.org/snapshots)

          Show
          David Medinets added a comment - I tried to 'mvn package' the project at https://github.com/joshelser/hive/tree/HIVE-7068 but ran into the following message for the 'common' module: [ERROR] Failed to execute goal on project hive-common: Could not resolve dependencies for project org.apache.hive:hive-common:jar:0.14.0-SNAPSHOT: Could not find artifact org.apache.hive:hive-shims:jar:0.14.0-SNAPSHOT in apache.snapshots ( http://repository.apache.org/snapshots )
          Hide
          Josh Elser added a comment -

          Hi Carl Austin,

          I started re-hashing some of Brian's code up on my own Github account. Thanks for pointing this out. We definitely do not want to be pulling back all columns and filtering on the client (especially when we might also benefit from locality groups on the server). I stubbed out a lot of new code that tries to better encapsulate the workflow and minimize repetitive parsing of things out of the table properties and job configuration. I think I have also gotten INSERT to work, but it's not any close to a complete feature (e.g. no testing for it whatsoever).

          I've been pulled in other directions for the next week or two, but I should be returning to this by the beginning of July. I'll make a note to revisit the open issues on Brian's project.

          Show
          Josh Elser added a comment - Hi Carl Austin , I started re-hashing some of Brian's code up on my own Github account . Thanks for pointing this out. We definitely do not want to be pulling back all columns and filtering on the client (especially when we might also benefit from locality groups on the server). I stubbed out a lot of new code that tries to better encapsulate the workflow and minimize repetitive parsing of things out of the table properties and job configuration. I think I have also gotten INSERT to work, but it's not any close to a complete feature (e.g. no testing for it whatsoever). I've been pulled in other directions for the next week or two, but I should be returning to this by the beginning of July. I'll make a note to revisit the open issues on Brian's project.
          Hide
          Carl Austin added a comment -

          Also, note the bug with multiple ranges I created on the github page: https://github.com/bfemiano/accumulo-hive-storage-manager/issues/7. I've had a quick look and I think this bug is still present in this patch version, meaning that any predicate with multiple rowid predicates will actually scan entire tables rather than using the ranges.
          Apologies if I'm wrong, but I've just had a quick look at the code in a text editor and haven't been able to run it to ensure I am correct yet.

          Show
          Carl Austin added a comment - Also, note the bug with multiple ranges I created on the github page: https://github.com/bfemiano/accumulo-hive-storage-manager/issues/7 . I've had a quick look and I think this bug is still present in this patch version, meaning that any predicate with multiple rowid predicates will actually scan entire tables rather than using the ranges. Apologies if I'm wrong, but I've just had a quick look at the code in a text editor and haven't been able to run it to ensure I am correct yet.
          Hide
          Carl Austin added a comment -

          I've got a forked version of this for my testing purposes and anything that doesn't need all columns read is slower than it needs to be (a factor of 5 times when only selecting a single column for example in my testing), so I've modified it to only fetch the columns needed. I can't easily create a patch due to how far I've changed things, but the necessary bit is:

          In configure method:

                      ASTNode node = driver.parse(conf.get("hive.query.string"));
                      node = ParseUtils.findRootNonNullToken(node);
                      findColumns(node, columns);
                      Collection<Pair<Text, Text>> pairs = Lists.newArrayList();
                      if (columns.size() > 0) {
                          for (String col : columns) {
                              String[] pair = AccumuloHiveUtils.hiveToAccumulo(col, conf).split("\\|");
                              pairs.add(new Pair<Text, Text>(new Text(pair[0]), new Text(pair[1])));
                          }
                      } else {
                          pairs = getPairCollection(colQualFamPairs, false);
                      }
          

          A new method:

              public void findColumns(ASTNode node, List<String> columns) {
                  //TODO : This should be == HiveParser.TOK_TABLE_OR_COL not 784 but that doesn't actually seem to work in my case. This is a hacky fix and may not work for other versions of hive.
                  if (node.getToken().getType() == 784) {
                      columns.add(node.getChild(0).getText().toLowerCase());
                  } else {
                      if (node.getChildren() != null) {
                          for (Node child : node.getChildren()) {
                              findColumns((ASTNode)child, columns);
                          }
                      }
                  }
              }
          

          Obviously this isn't perfect yet and it doesn't take into account things like count(1) which will not return any columns so it will fetch all still.

          I've also added something that allows you to configure additional columns as a serde property when creating the table. I've done this so that columns used in iterators to calculate new columns, may not be mapped in the create statement otherwise, not fetched and thus those "calculated" columns will never work.

          Let me know if you'd like any more info.

          Show
          Carl Austin added a comment - I've got a forked version of this for my testing purposes and anything that doesn't need all columns read is slower than it needs to be (a factor of 5 times when only selecting a single column for example in my testing), so I've modified it to only fetch the columns needed. I can't easily create a patch due to how far I've changed things, but the necessary bit is: In configure method: ASTNode node = driver.parse(conf.get( "hive.query.string" )); node = ParseUtils.findRootNonNullToken(node); findColumns(node, columns); Collection<Pair<Text, Text>> pairs = Lists.newArrayList(); if (columns.size() > 0) { for ( String col : columns) { String [] pair = AccumuloHiveUtils.hiveToAccumulo(col, conf).split( "\\|" ); pairs.add( new Pair<Text, Text>( new Text(pair[0]), new Text(pair[1]))); } } else { pairs = getPairCollection(colQualFamPairs, false ); } A new method: public void findColumns(ASTNode node, List< String > columns) { //TODO : This should be == HiveParser.TOK_TABLE_OR_COL not 784 but that doesn't actually seem to work in my case . This is a hacky fix and may not work for other versions of hive. if (node.getToken().getType() == 784) { columns.add(node.getChild(0).getText().toLowerCase()); } else { if (node.getChildren() != null ) { for (Node child : node.getChildren()) { findColumns((ASTNode)child, columns); } } } } Obviously this isn't perfect yet and it doesn't take into account things like count(1) which will not return any columns so it will fetch all still. I've also added something that allows you to configure additional columns as a serde property when creating the table. I've done this so that columns used in iterators to calculate new columns, may not be mapped in the create statement otherwise, not fetched and thus those "calculated" columns will never work. Let me know if you'd like any more info.
          Josh Elser made changes -
          Link This issue relates to HIVE-7068 [ HIVE-7068 ]
          Sean Busbey made changes -
          Component/s contrib [ 12316610 ]
          Christopher Tubbs made changes -
          Link This issue relates to ACCUMULO-756 [ ACCUMULO-756 ]
          Hide
          Keith Turner added a comment -

          Sadly no. It requires Accumulo 1.5+ where libthrift 0.9 is used.

          Thats unfortunate. In the future, ACCUMULO-756 may help with this type of issue. That would make it easier to switch between different versions of thrift.

          Show
          Keith Turner added a comment - Sadly no. It requires Accumulo 1.5+ where libthrift 0.9 is used. Thats unfortunate. In the future, ACCUMULO-756 may help with this type of issue. That would make it easier to switch between different versions of thrift.
          Hide
          Brian Femiano added a comment -

          Sadly no. It requires Accumulo 1.5+ where libthrift 0.9 is used.

          I tested it with the latest trunk Accumulo 1.6 for convenience, and so it could live in trunk.

          Show
          Brian Femiano added a comment - Sadly no. It requires Accumulo 1.5+ where libthrift 0.9 is used. I tested it with the latest trunk Accumulo 1.6 for convenience, and so it could live in trunk.
          Hide
          Keith Turner added a comment -

          I quickly scanned through the patch and noticed it depends on 1.6-SNAPSHOT. I was curious if this patch would work against 1.4?

          Show
          Keith Turner added a comment - I quickly scanned through the patch and noticed it depends on 1.6-SNAPSHOT. I was curious if this patch would work against 1.4?
          Hide
          Brian Femiano added a comment -

          Run the patch from the base trunk directory to install trunk/contrib/hive-storage-handler.

          For a full summary of current functionality in the patch, see https://github.com/bfemiano/accumulo-hive-storage-manager/blob/master/README.md and related tutorials

          To get started: https://github.com/bfemiano/accumulo-hive-storage-manager/wiki/Basic-Tutorial

          I unfortunately can't be at the Hackathon to offer in-person support. I'll try to get the word out.

          Show
          Brian Femiano added a comment - Run the patch from the base trunk directory to install trunk/contrib/hive-storage-handler. For a full summary of current functionality in the patch, see https://github.com/bfemiano/accumulo-hive-storage-manager/blob/master/README.md and related tutorials To get started: https://github.com/bfemiano/accumulo-hive-storage-manager/wiki/Basic-Tutorial I unfortunately can't be at the Hackathon to offer in-person support. I'll try to get the word out.
          Brian Femiano made changes -
          Attachment ACCUMULO-143.patch [ 12581665 ]
          Brian Femiano made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Affects Version/s 1.6.0 [ 12322468 ]
          Hide
          Brian Femiano added a comment -

          I have started this but not yet finished. Very soon I plan to make time to finish this.

          Show
          Brian Femiano added a comment - I have started this but not yet finished. Very soon I plan to make time to finish this.
          Hide
          Dmitry Vasilenko added a comment -

          Hi Brian,

          Did you have a chance to work on that? We are considering HCatalog->Hive->Accumulo and the work you are doing on Hive->Accumulo sounds very promising.

          Show
          Dmitry Vasilenko added a comment - Hi Brian, Did you have a chance to work on that? We are considering HCatalog->Hive->Accumulo and the work you are doing on Hive->Accumulo sounds very promising.
          Hide
          Billie Rinaldi added a comment -

          Sounds great!

          Show
          Billie Rinaldi added a comment - Sounds great!
          Hide
          Brian Femiano added a comment -

          I am actively working on this. Should be a 2 to 3 weeks of my own time to implement and verify.

          Show
          Brian Femiano added a comment - I am actively working on this. Should be a 2 to 3 weeks of my own time to implement and verify.
          Gavin made changes -
          Field Original Value New Value
          Workflow no-reopen-closed, patch-avail [ 12642037 ] patch-available, re-open possible [ 12671557 ]
          Hide
          Brian Femiano added a comment -

          I have started to look at implementing this natively within the Hive API. It involves creating a custom InputFormat, Serde, and StorageHandler similar to the HBaseStorageHandler.

          This would integrate directly into HQL scripts and avoid direct access to Hive JDBC.

          Show
          Brian Femiano added a comment - I have started to look at implementing this natively within the Hive API. It involves creating a custom InputFormat, Serde, and StorageHandler similar to the HBaseStorageHandler. This would integrate directly into HQL scripts and avoid direct access to Hive JDBC.
          Hide
          jv added a comment -

          After talking to Jesse Yates and a few others, I'm withdrawing my comment about Culvert being adequate as a Hive driver. While it plans to provide adequate layers to run hive over Accumulo, it's a bit round about. I think we should go about writing a direct hive driver that does that, and only that.

          Show
          jv added a comment - After talking to Jesse Yates and a few others, I'm withdrawing my comment about Culvert being adequate as a Hive driver. While it plans to provide adequate layers to run hive over Accumulo, it's a bit round about. I think we should go about writing a direct hive driver that does that, and only that.
          Hide
          jv added a comment - - edited

          Currently the https://github.com/booz-allen-hamilton/culvert project does not support Hive, but it seems to be the direction it's going in. Once that's in place, I think that's adequate to consider this task done.

          Show
          jv added a comment - - edited Currently the https://github.com/booz-allen-hamilton/culvert project does not support Hive, but it seems to be the direction it's going in. Once that's in place, I think that's adequate to consider this task done.
          Keith Turner created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Keith Turner
            • Votes:
              5 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development