Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Locality groups are a neat feature, but there is no reason to limit partitioning to column families. Data could be partitioned based on any criteria. For example if a user is interested in querying recent data and ageing off old data partitioning locality groups based in timestamp would be useful. This could be accomplished by letting users specify a partitioner plugin that is used at compaction and scan time. Scans would need an ability to pass options to the partitioner.

        Activity

        Hide
        Keith Turner added a comment -

        I will put together an initial design document and post it for review.

        Show
        Keith Turner added a comment - I will put together an initial design document and post it for review.
        Hide
        Aaron Cordova added a comment -

        I know this might sound crazy but 'more general' doesn't always mean better. There is a cost to generality and it is complexity. For example, several people have read the MapReduce paper and said 'oh you can easily create a more general computation framework than that, that includes message passing etc' and those people miss the point of why MapReduce is so widely adopted and why more general systems like MPI are not - its simplicity.

        In this case, the cost is that the user has to now decide at what level to group data to get the locality they desire, and pass options where they normally might not. Users already have the ability to use the column family to store whatever data they want, knowing that the data they store in column families can be used for physical partitioning.

        So I suppose I'm looking for more justification before adding more complexity to an admittedly already more general/complex implementation of BigTable.

        Show
        Aaron Cordova added a comment - I know this might sound crazy but 'more general' doesn't always mean better. There is a cost to generality and it is complexity. For example, several people have read the MapReduce paper and said 'oh you can easily create a more general computation framework than that, that includes message passing etc' and those people miss the point of why MapReduce is so widely adopted and why more general systems like MPI are not - its simplicity. In this case, the cost is that the user has to now decide at what level to group data to get the locality they desire, and pass options where they normally might not. Users already have the ability to use the column family to store whatever data they want, knowing that the data they store in column families can be used for physical partitioning. So I suppose I'm looking for more justification before adding more complexity to an admittedly already more general/complex implementation of BigTable.
        Hide
        jv added a comment -

        This feature by itself is nice, but it doesn't really do a whole lot without supporting expressions ( https://issues.apache.org/jira/browse/ACCUMULO-164 ) . But once you start allowing expressions, you then need to support orders of locality, because data can only exist in a single locality group. This can increase complexity for the user, or at the very least will make the API more cludgy.

        I'm all for making things pluggable, but we need to make it designed to ensure that things are not easily borked by the user. This includes either forcing the interface to only handle one locality group at a time or redesigning rfile to allow writing to different locality groups. We also need to make sure we can optimize scan queries of data pre-dating the locality group. Right now, it's really nice that data can only belong to one locality group. This will make it so all data must be checked to see if it belongs in the new locality group, which will make a big hit. Perhaps we should encourage majc after applying a new locality group.

        Show
        jv added a comment - This feature by itself is nice, but it doesn't really do a whole lot without supporting expressions ( https://issues.apache.org/jira/browse/ACCUMULO-164 ) . But once you start allowing expressions, you then need to support orders of locality, because data can only exist in a single locality group. This can increase complexity for the user, or at the very least will make the API more cludgy. I'm all for making things pluggable, but we need to make it designed to ensure that things are not easily borked by the user. This includes either forcing the interface to only handle one locality group at a time or redesigning rfile to allow writing to different locality groups. We also need to make sure we can optimize scan queries of data pre-dating the locality group. Right now, it's really nice that data can only belong to one locality group. This will make it so all data must be checked to see if it belongs in the new locality group, which will make a big hit. Perhaps we should encourage majc after applying a new locality group.
        Hide
        Billie Rinaldi added a comment -

        The core issue this would address is that many users want to be able to scan over timestamp ranges efficiently. Keith's suggestion is a relatively straightforward way to support this. It also sounds like it could get complex if we're not careful, but I think it has potential.

        Show
        Billie Rinaldi added a comment - The core issue this would address is that many users want to be able to scan over timestamp ranges efficiently. Keith's suggestion is a relatively straightforward way to support this. It also sounds like it could get complex if we're not careful, but I think it has potential.
        Hide
        Adam Fuchs added a comment -

        I had another thought on this: locality groups are good for features that are in a relatively constant, low cardinality set with a fairly dense distribution across the primary partitioning dimension. Also, queries must be aligned with the locality group frequently enough to amortize the cost of that partitioning. This means that the current column family-based locality groups only really help when cells in sorted order frequently oscillate between locality groups. I want to say that this type of feature tends to be something that is explicitly modeled based on how the user wants to query their data. If the user decides to put this information in the row or the column qualifier, could they just as easily put it into the column family? By the way, expressions like John mentions in ACCUMULO-164 help to groups high cardinality features into a low cardinality set of groups, so I think we're on the same page there.

        Partitioning based on the timestamp is an interesting consideration. In this case, you would want a small number of ranges of timestamps to be "active" (not aged off yet) at any one time. Timestamps are a bit special, though, because they tend to be inserted in increasing order. Instead of using the locality group mechanism, we might achieve better performance by modifying the major compaction selection algorithm to avoid merging files that have very different timestamp ranges. Keeping track of timestamps on a per-file or per-block basis would support bulk filtering, and would be as (or more) efficient than locality groups. Might this be another approach to consider?

        Like Aaron, I think we need some more details on envisioned scenarios in which more generic locality groups would be useful before we jump too deeply into implementing them.

        Show
        Adam Fuchs added a comment - I had another thought on this: locality groups are good for features that are in a relatively constant, low cardinality set with a fairly dense distribution across the primary partitioning dimension. Also, queries must be aligned with the locality group frequently enough to amortize the cost of that partitioning. This means that the current column family-based locality groups only really help when cells in sorted order frequently oscillate between locality groups. I want to say that this type of feature tends to be something that is explicitly modeled based on how the user wants to query their data. If the user decides to put this information in the row or the column qualifier, could they just as easily put it into the column family? By the way, expressions like John mentions in ACCUMULO-164 help to groups high cardinality features into a low cardinality set of groups, so I think we're on the same page there. Partitioning based on the timestamp is an interesting consideration. In this case, you would want a small number of ranges of timestamps to be "active" (not aged off yet) at any one time. Timestamps are a bit special, though, because they tend to be inserted in increasing order. Instead of using the locality group mechanism, we might achieve better performance by modifying the major compaction selection algorithm to avoid merging files that have very different timestamp ranges. Keeping track of timestamps on a per-file or per-block basis would support bulk filtering, and would be as (or more) efficient than locality groups. Might this be another approach to consider? Like Aaron, I think we need some more details on envisioned scenarios in which more generic locality groups would be useful before we jump too deeply into implementing them.
        Hide
        Todd Lipcon added a comment -

        FWIW, in HBase, we maintain timestamp min/max per HFile, and use that to cull files at query time if the query has a timestamp range predicate. As of fairly recently we also support culling these files at compaction time without having to rewrite them, if a file completely falls out of the configured table TTL. (variously related to HBASE-5199, HBASE-5274, HBASE-5010, HBASE-2265)

        I also somewhat agree with Aaron's sentiment above - these timestamp optimizations were pretty easy to do in HBase because timestamp is a first class citizen feature instead of something implemented by a more general framework.

        Show
        Todd Lipcon added a comment - FWIW, in HBase, we maintain timestamp min/max per HFile, and use that to cull files at query time if the query has a timestamp range predicate. As of fairly recently we also support culling these files at compaction time without having to rewrite them, if a file completely falls out of the configured table TTL. (variously related to HBASE-5199 , HBASE-5274 , HBASE-5010 , HBASE-2265 ) I also somewhat agree with Aaron's sentiment above - these timestamp optimizations were pretty easy to do in HBase because timestamp is a first class citizen feature instead of something implemented by a more general framework.
        Hide
        Keith Turner added a comment -

        A stab at a deign for this possible feature.

        Show
        Keith Turner added a comment - A stab at a deign for this possible feature.
        Hide
        Keith Turner added a comment - - edited

        I am thinking that keeping a min/max time stamp per file may satisfy some use cases but not all. It would certainly be helpful. The compaction algorithm may need to be modified as Adam suggested to make it more effective. The way major compaction currently works in Accumulo older data will eventually end up in the largest file. If your goal is to avoid this file under certain circumstances, then the user has no explicit control over that. Also if you want to age off older data, you will probably still need to read this entire file to do that.

        If they want to scan the last 6 months of data for example and the largest file overlaps this time range but only 10% of the data in the file matches the range, then a lot of data needs to be filtered. Does HBase do anything special to deal with case.

        Why limit locality groups to only column families?

        • Increases model complexity. I think this is true. I think the complexity of the locality group model is not increased. If you understand partitioning on column families, you will easily understand the concept of partitioning on any part of the key. It certainly does increase the complexity of the big table model as a whole though. It would certainly give users more rope to hang themselves. Personally I am not opposed to this.
        • Increases code complexity. I do not think this is true. This would actually simplify the code and make this functionality much easier to test in isolation. I have found this with iterators, they dramatically decreased the complexity of the scan code. When iterators were first introduced, the scan loop was starting to get fairly complex. This seems a lot cleaner than customizing the current code to meet needs. OF course, end users may not care about the complexity of the accumulo source code. They just want it to solve their problems.
        • There are no compeling use cases. These must exist. I think the original time based locality group is one, is their a better simpler way to achieve this? That would remove this use case. The HBase design is simpler in terms of the model, but the code sounds more complex. Also this model does not give the user explicit control w/o allowing them to configure the compaction process in some complex way.
        Show
        Keith Turner added a comment - - edited I am thinking that keeping a min/max time stamp per file may satisfy some use cases but not all. It would certainly be helpful. The compaction algorithm may need to be modified as Adam suggested to make it more effective. The way major compaction currently works in Accumulo older data will eventually end up in the largest file. If your goal is to avoid this file under certain circumstances, then the user has no explicit control over that. Also if you want to age off older data, you will probably still need to read this entire file to do that. If they want to scan the last 6 months of data for example and the largest file overlaps this time range but only 10% of the data in the file matches the range, then a lot of data needs to be filtered. Does HBase do anything special to deal with case. Why limit locality groups to only column families? Increases model complexity. I think this is true. I think the complexity of the locality group model is not increased. If you understand partitioning on column families, you will easily understand the concept of partitioning on any part of the key. It certainly does increase the complexity of the big table model as a whole though. It would certainly give users more rope to hang themselves. Personally I am not opposed to this. Increases code complexity. I do not think this is true. This would actually simplify the code and make this functionality much easier to test in isolation. I have found this with iterators, they dramatically decreased the complexity of the scan code. When iterators were first introduced, the scan loop was starting to get fairly complex. This seems a lot cleaner than customizing the current code to meet needs. OF course, end users may not care about the complexity of the accumulo source code. They just want it to solve their problems. There are no compeling use cases. These must exist. I think the original time based locality group is one, is their a better simpler way to achieve this? That would remove this use case. The HBase design is simpler in terms of the model, but the code sounds more complex. Also this model does not give the user explicit control w/o allowing them to configure the compaction process in some complex way.
        Hide
        Todd Lipcon added a comment -

        If they want to scan the last 6 months of data for example and the largest file overlaps this time range but only 10% of the data in the file matches the range, then a lot of data needs to be filtered. Does HBase do anything special to deal with case.

        We have a setting for "max file size" beyond which a file won't be included in compactions. Setting that to a few GB would be prudent in a case where most of your queries are time-bound. Of course, there's an associated cost against scanners which aren't time-bound, as they'll have to merge all files, but in some cases it's fine.

        You can see more discussion about this in HBASE-4717

        Show
        Todd Lipcon added a comment - If they want to scan the last 6 months of data for example and the largest file overlaps this time range but only 10% of the data in the file matches the range, then a lot of data needs to be filtered. Does HBase do anything special to deal with case. We have a setting for "max file size" beyond which a file won't be included in compactions. Setting that to a few GB would be prudent in a case where most of your queries are time-bound. Of course, there's an associated cost against scanners which aren't time-bound, as they'll have to merge all files, but in some cases it's fine. You can see more discussion about this in HBASE-4717
        Hide
        Aaron Cordova added a comment - - edited

        Just another comment on the type of complexity I'd like to avoid.

        Specifically, it's good to have orthogonality in your features.

        Locality groups are for physical partitioning, timestamps are for data versioning. If someone wants to partition their data into time-ranges they are free to do so, using locality groups. They simply have to decide on what their column families will be, building some information about time ranges into them, and assign them to locality groups.

        Another kind of partitioning happens with row IDs, allowing accesses to a small range of rows to involve one or a small number of servers. This kind of partitioning is nice because it's automatic, one doesn't have to worry about whether the ranges are the right granularity, Accumulo splits based on size.

        Now we're talking about adding a third way to physically split data, timestamps, and basing it on something designed for some other purpose, which is data versioning.

        Timestamps do allow users to only get data for a particular time period, but the intent is to limit the data after the row and columns have been selected, or maybe for short scans. I'm guessing your users want to scan over a lot of rows and columns, but that fall within a particular time period. For this they should build time ranges into their rows or columns.

        There are already two ways to let users do this, I think adding a third will just add additional complexity and could interfere with the original versioning functionality. Not necessarily code complexity, rather, complexity in the users' minds as to how to model their data.

        Show
        Aaron Cordova added a comment - - edited Just another comment on the type of complexity I'd like to avoid. Specifically, it's good to have orthogonality in your features. Locality groups are for physical partitioning, timestamps are for data versioning. If someone wants to partition their data into time-ranges they are free to do so, using locality groups. They simply have to decide on what their column families will be, building some information about time ranges into them, and assign them to locality groups. Another kind of partitioning happens with row IDs, allowing accesses to a small range of rows to involve one or a small number of servers. This kind of partitioning is nice because it's automatic, one doesn't have to worry about whether the ranges are the right granularity, Accumulo splits based on size. Now we're talking about adding a third way to physically split data, timestamps, and basing it on something designed for some other purpose, which is data versioning. Timestamps do allow users to only get data for a particular time period, but the intent is to limit the data after the row and columns have been selected, or maybe for short scans. I'm guessing your users want to scan over a lot of rows and columns, but that fall within a particular time period. For this they should build time ranges into their rows or columns. There are already two ways to let users do this, I think adding a third will just add additional complexity and could interfere with the original versioning functionality. Not necessarily code complexity, rather, complexity in the users' minds as to how to model their data.
        Hide
        Billie Rinaldi added a comment -

        I agree entirely that it would be better if timestamps were used solely for versioning. However, because the API allows users to set timestamps manually, they often do; and they expect them to be used like timestamps. If only this part of the key had been named "version" instead.

        Show
        Billie Rinaldi added a comment - I agree entirely that it would be better if timestamps were used solely for versioning. However, because the API allows users to set timestamps manually, they often do; and they expect them to be used like timestamps. If only this part of the key had been named "version" instead.
        Hide
        Aaron Cordova added a comment -

        Even if users use the server-provided timestamps or their own, the timestamp still falls after the row and column, and is used the same way: to limit values after the rows and columns have been identified.

        To me it seems as if this happened, as a little play:

        BigTable Guys: look you can physically partition your data automatically using the rows!

        Users: Great! That works, but maybe I want an additional, secondary partitioning?

        BG: hmm. ok, how about you can also partition on the column family? It's the next item in the hierarchy, doesn't add too much complexity, pretty straightforward. Just specify them into groups called locality groups and I think we can keep this under control.

        Users: Yay! You guys rock!

        BG: You're welcome.

        Other users: Hey, locality groups are cool, but can I partition on column qualifiers?

        BG: why are rows and column families insufficient?

        OU: well, I don't know, I just really like to slice things every way possible.

        BG: sigh ..

        Yet other users: Wait, what about timestamps? You know what's more general than partitioning on a few elements of the data model? Partitioning on ALL the elements of the data model! So sweet. More general means more better!

        BG: I'm quitting to go work at Facebook.

        Show
        Aaron Cordova added a comment - Even if users use the server-provided timestamps or their own, the timestamp still falls after the row and column, and is used the same way: to limit values after the rows and columns have been identified. To me it seems as if this happened, as a little play: BigTable Guys: look you can physically partition your data automatically using the rows! Users: Great! That works, but maybe I want an additional, secondary partitioning? BG: hmm. ok, how about you can also partition on the column family? It's the next item in the hierarchy, doesn't add too much complexity, pretty straightforward. Just specify them into groups called locality groups and I think we can keep this under control. Users: Yay! You guys rock! BG: You're welcome. Other users: Hey, locality groups are cool, but can I partition on column qualifiers? BG: why are rows and column families insufficient? OU: well, I don't know, I just really like to slice things every way possible. BG: sigh .. Yet other users: Wait, what about timestamps? You know what's more general than partitioning on a few elements of the data model? Partitioning on ALL the elements of the data model! So sweet. More general means more better! BG: I'm quitting to go work at Facebook.
        Hide
        Keith Turner added a comment -

        Users do want this capability. They keep asking for it. We do turn around them and tell them to sort their data differently. They don't always like that answer. The intent of this is to meet a user need.

        Storing temporal information in the column family is a possibility. It would work well for some cases, like having two locality groups one thats the current month and another thats everything else. You put the month in the column family and reconfigure the locality groups every month.

        However, if you would like something like LG1 = < day old, LG2 = < month old, LG3 = < year old this would not be possible w/ the current locality group implementation. However ACCUMULO-164 may make this possible. Store time to the day in the column family. John pointed out one problem w/ this, its hard to automatically determine that patterns match disjoint sets. I need to think through ACCUMULO-164 some more and see what the possible gotchas are.

        If you have to duplicate the data in the timestamp into your column family to accomplish your goals, does this indicate a problem with the model? It do not think its clean, but its ok w/ me.

        Show
        Keith Turner added a comment - Users do want this capability. They keep asking for it. We do turn around them and tell them to sort their data differently. They don't always like that answer. The intent of this is to meet a user need. Storing temporal information in the column family is a possibility. It would work well for some cases, like having two locality groups one thats the current month and another thats everything else. You put the month in the column family and reconfigure the locality groups every month. However, if you would like something like LG1 = < day old, LG2 = < month old, LG3 = < year old this would not be possible w/ the current locality group implementation. However ACCUMULO-164 may make this possible. Store time to the day in the column family. John pointed out one problem w/ this, its hard to automatically determine that patterns match disjoint sets. I need to think through ACCUMULO-164 some more and see what the possible gotchas are. If you have to duplicate the data in the timestamp into your column family to accomplish your goals, does this indicate a problem with the model? It do not think its clean, but its ok w/ me.

          People

          • Assignee:
            Keith Turner
            Reporter:
            Keith Turner
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development