Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-2883

Add API method(s) that support fetching currently assigned locations for tablets

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.8.0
    • Component/s: client
    • Labels:
      None

      Description

      TabletLocator already exists, but isn't officially a part of the "public API" and is clunky for users to invoke. In trying to co-locate external processes with the tabletservers that are hosting some data, it would be nice to have some means that users can invoke that will return them these assignments.

      Memory concerns are an issue for tables with many splits (e.g. avoiding creating a Set of 100k tablet locations for a table), but we also want to provide the ability to ask pointed questions. Likely building something that accepts a Range (or Collection<Range>) would be best.

        Issue Links

          Activity

          Hide
          afuchs Adam Fuchs added a comment -

          We also need to consider security implications here. The tablet splits are automatically generated from row identifiers in the data, so knowledge of tablet splits carries some information about keys in table. That information is already available to clients under the covers, but there may be room for some sort of permission model on metadata if we plan on exposing this more publicly.

          Show
          afuchs Adam Fuchs added a comment - We also need to consider security implications here. The tablet splits are automatically generated from row identifiers in the data, so knowledge of tablet splits carries some information about keys in table. That information is already available to clients under the covers, but there may be room for some sort of permission model on metadata if we plan on exposing this more publicly.
          Hide
          elserj Josh Elser added a comment -

          there may be room for some sort of permission model on metadata if we plan on exposing this more publicly.

          I was thinking about this problem this morning. Right now, anyone with a minor bit of cunning could read the metadata table and get the same information (because we can't restrict read access to it), right? Is that a fundamental problem we just have to accept? How could an extra permission layer on top help us increase the total security of the system? I'm not sure presently

          Show
          elserj Josh Elser added a comment - there may be room for some sort of permission model on metadata if we plan on exposing this more publicly. I was thinking about this problem this morning. Right now, anyone with a minor bit of cunning could read the metadata table and get the same information (because we can't restrict read access to it), right? Is that a fundamental problem we just have to accept? How could an extra permission layer on top help us increase the total security of the system? I'm not sure presently
          Hide
          kturner Keith Turner added a comment -

          Christopher Tubbs has proposed a model where a user can only read portions of the metadata table for tables they have read or write access.

          Show
          kturner Keith Turner added a comment - Christopher Tubbs has proposed a model where a user can only read portions of the metadata table for tables they have read or write access.
          Hide
          elserj Josh Elser added a comment -

          user can only read portions of the metadata table for tables they have read or write access.

          That helps to isolate users from knowing anything about tables that don't have permission to, but it still doesn't help the potential leak of sensitive info for rowIDs that the user might not have access to otherwise see.

          Eric Newton has also noted that you could also easily just make calls against the tserver to get the at tablet boundaries (the response of no data is different than tablet not hosted here).

          It doesn't seem like there's a universal solution that doesn't involve delegation through some other gateway. The thrift proxy server is one example which would do this (again, Eric Newton pointing this out).

          Show
          elserj Josh Elser added a comment - user can only read portions of the metadata table for tables they have read or write access. That helps to isolate users from knowing anything about tables that don't have permission to, but it still doesn't help the potential leak of sensitive info for rowIDs that the user might not have access to otherwise see. Eric Newton has also noted that you could also easily just make calls against the tserver to get the at tablet boundaries (the response of no data is different than tablet not hosted here). It doesn't seem like there's a universal solution that doesn't involve delegation through some other gateway. The thrift proxy server is one example which would do this (again, Eric Newton pointing this out).
          Hide
          afuchs Adam Fuchs added a comment -

          Josh Elser One way to look at this is that there is a certain amount of trust given to any app that is allowed to access Accumulo. Handling Accumulo's metadata is included in that trust. A permission model protecting metadata might encourage apps not to simply pass that information on but first consider the potential sensitivity. Sorry to speak philosophically, but I don't have a better solution either in the general case.

          Show
          afuchs Adam Fuchs added a comment - Josh Elser One way to look at this is that there is a certain amount of trust given to any app that is allowed to access Accumulo. Handling Accumulo's metadata is included in that trust. A permission model protecting metadata might encourage apps not to simply pass that information on but first consider the potential sensitivity. Sorry to speak philosophically, but I don't have a better solution either in the general case.
          Hide
          kturner Keith Turner added a comment -

          but it still doesn't help the potential leak of sensitive info for rowIDs

          right, it does not address that issue. Its still a nice improvement in my opinion to limit metadata access for table you can not read. Also tserver could give a can not read exception instead of not serving tablet exception in the other case you mentioned.

          Show
          kturner Keith Turner added a comment - but it still doesn't help the potential leak of sensitive info for rowIDs right, it does not address that issue. Its still a nice improvement in my opinion to limit metadata access for table you can not read. Also tserver could give a can not read exception instead of not serving tablet exception in the other case you mentioned.
          Hide
          elserj Josh Elser added a comment -

          Adam Fuchs, no, that's a good way to approach it. Thanks.

          Generally, I was thinking to the larger problem that has been mentioned from time to time about inadvertent "leakage". One case I was thinking about before was logging of KeyExtents for things like assignments, compactions, merges, etc. Trying to work through in my head if there's any practical change that we could make to prevent this. I also just remembered that users already have the ability to fetch splits for a table

          It seems like it can be summarized as: if you have access to the Java API, you're going to be able to extract "information" one way or another. At the same time, there's the common "phrase" that security is never solved by a single component but the interaction of many moving parts.

          Show
          elserj Josh Elser added a comment - Adam Fuchs , no, that's a good way to approach it. Thanks. Generally, I was thinking to the larger problem that has been mentioned from time to time about inadvertent "leakage". One case I was thinking about before was logging of KeyExtents for things like assignments, compactions, merges, etc. Trying to work through in my head if there's any practical change that we could make to prevent this. I also just remembered that users already have the ability to fetch splits for a table It seems like it can be summarized as: if you have access to the Java API, you're going to be able to extract "information" one way or another. At the same time, there's the common "phrase" that security is never solved by a single component but the interaction of many moving parts.
          Hide
          elserj Josh Elser added a comment -

          right, it does not address that issue. Its still a nice improvement in my opinion to limit metadata access for table you can not read. Also tserver could give a can not read exception instead of not serving tablet exception in the other case you mentioned.

          Agreed. Thanks for clarifying.

          Show
          elserj Josh Elser added a comment - right, it does not address that issue. Its still a nice improvement in my opinion to limit metadata access for table you can not read. Also tserver could give a can not read exception instead of not serving tablet exception in the other case you mentioned. Agreed. Thanks for clarifying.
          Hide
          ctubbsii Christopher Tubbs added a comment -

          It should also be mentioned that we already have public API for getting split points. This feature adds no additional security concerns about data leakage, unless we consider the location of the data to be additionally sensitive, and I can't imagine why we should, since the whole point of Accumulo is to manage and reveal that information to a client, so it can efficiently query from the distributed system. If you don't want to reveal the location of a tablet, Accumulo (and the whole BigTable model) is not your solution (at least, not without an intermediate: proxy or web service).

          Show
          ctubbsii Christopher Tubbs added a comment - It should also be mentioned that we already have public API for getting split points. This feature adds no additional security concerns about data leakage, unless we consider the location of the data to be additionally sensitive, and I can't imagine why we should, since the whole point of Accumulo is to manage and reveal that information to a client, so it can efficiently query from the distributed system. If you don't want to reveal the location of a tablet, Accumulo (and the whole BigTable model) is not your solution (at least, not without an intermediate: proxy or web service).
          Hide
          rfecher Rich Fecher added a comment -

          If it would help to understand a third-party use case, we ran into this need of a tablet locator as part of the public API. We had needed an input format that behaves significantly differently from Accumulo's public API input format (ie. inheritance or delegation would not suffice in our case) but it was still very important to us to try to co-locate processing. So to attempt to co-locate with the tablets, we used the TabletLocator and frankly a snippet of code very similar to the internals of Accumulo's input format (https://github.com/ngageoint/geowave/blob/master/geowave-accumulo/src/main/java/mil/nga/giat/geowave/accumulo/mapreduce/input/GeoWaveInputFormat.java#L410)

          However, to support changes between 1.5.1 and 1.6.0 we ended up using conditional compilation provided by the maven munge plugin. This is less than ideal and are looking forward to hooks in the public API for this.

          Show
          rfecher Rich Fecher added a comment - If it would help to understand a third-party use case, we ran into this need of a tablet locator as part of the public API. We had needed an input format that behaves significantly differently from Accumulo's public API input format (ie. inheritance or delegation would not suffice in our case) but it was still very important to us to try to co-locate processing. So to attempt to co-locate with the tablets, we used the TabletLocator and frankly a snippet of code very similar to the internals of Accumulo's input format ( https://github.com/ngageoint/geowave/blob/master/geowave-accumulo/src/main/java/mil/nga/giat/geowave/accumulo/mapreduce/input/GeoWaveInputFormat.java#L410 ) However, to support changes between 1.5.1 and 1.6.0 we ended up using conditional compilation provided by the maven munge plugin. This is less than ideal and are looking forward to hooks in the public API for this.
          Hide
          elserj Josh Elser added a comment -

          Thanks, Rich Fecher. That's helpful information. Sorry for the pain between 1.5 and 1.6 – hopefully we can make things better for 1.7.

          Show
          elserj Josh Elser added a comment - Thanks, Rich Fecher . That's helpful information. Sorry for the pain between 1.5 and 1.6 – hopefully we can make things better for 1.7.
          Hide
          rweeks Russ Weeks added a comment -

          Another 3rd-party use case: I'd love to see functionality similar to HBase's RegionLocator in order to support Accumulo as a storage back-end for Titan.

          Titan seems to have some optimizations to put KV pairs on a local partition of the key space, if possible. I'm not sure if this makes sense for Accumulo (or HBase for that matter)... just trying to follow precedent.

          Show
          rweeks Russ Weeks added a comment - Another 3rd-party use case: I'd love to see functionality similar to HBase's RegionLocator in order to support Accumulo as a storage back-end for Titan. Titan seems to have some optimizations to put KV pairs on a local partition of the key space , if possible. I'm not sure if this makes sense for Accumulo (or HBase for that matter)... just trying to follow precedent.
          Hide
          kturner Keith Turner added a comment -

          I plan on working on a patch for this for 1.8

          Show
          kturner Keith Turner added a comment - I plan on working on a patch for this for 1.8
          Hide
          elserj Josh Elser added a comment -

          Wooo!

          Show
          elserj Josh Elser added a comment - Wooo!
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user keith-turner opened a pull request:

          https://github.com/apache/accumulo/pull/53

          ACCUMULO-2883 Added locating tablets to API

          I have modified Gora to use this update in a [branch][2]. I plan to experiment with modifying [geowave][1] to use this new API before pusing to master.

          [1]: https://github.com/ngageoint/geowave
          [2]: https://github.com/keith-turner/gora/tree/GORA-414-180

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/keith-turner/accumulo ACCUMULO-2883

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/accumulo/pull/53.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #53


          commit f89e386148173dbd902765da711c0c553fd37fa6
          Author: Keith Turner <kturner@apache.org>
          Date: 2015-11-13T04:31:22Z

          ACCUMULO-2883 Added locating tablets to API


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user keith-turner opened a pull request: https://github.com/apache/accumulo/pull/53 ACCUMULO-2883 Added locating tablets to API I have modified Gora to use this update in a [branch] [2] . I plan to experiment with modifying [geowave] [1] to use this new API before pusing to master. [1] : https://github.com/ngageoint/geowave [2] : https://github.com/keith-turner/gora/tree/GORA-414-180 You can merge this pull request into a Git repository by running: $ git pull https://github.com/keith-turner/accumulo ACCUMULO-2883 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/accumulo/pull/53.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #53 commit f89e386148173dbd902765da711c0c553fd37fa6 Author: Keith Turner <kturner@apache.org> Date: 2015-11-13T04:31:22Z ACCUMULO-2883 Added locating tablets to API
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on the pull request:

          https://github.com/apache/accumulo/pull/53#issuecomment-162987572

          I was able to get geowave working with these changes ( [my branch](https://github.com/keith-turner/geowave/tree/use_locator_api)). Was able to successfully run the geowave test BasicMapReduceIT that @rfecher told me about in the geowave chat room.

          I'm happy with this PR and will push soon.

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on the pull request: https://github.com/apache/accumulo/pull/53#issuecomment-162987572 I was able to get geowave working with these changes ( [my branch] ( https://github.com/keith-turner/geowave/tree/use_locator_api )). Was able to successfully run the geowave test BasicMapReduceIT that @rfecher told me about in the geowave chat room. I'm happy with this PR and will push soon.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner closed the pull request at:

          https://github.com/apache/accumulo/pull/53

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner closed the pull request at: https://github.com/apache/accumulo/pull/53
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on the pull request:

          https://github.com/apache/accumulo/pull/53#issuecomment-163013460

          :+1: Excellent! Great work, Keith.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on the pull request: https://github.com/apache/accumulo/pull/53#issuecomment-163013460 :+1: Excellent! Great work, Keith.

            People

            • Assignee:
              kturner Keith Turner
              Reporter:
              elserj Josh Elser
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m

                  Development