HBase
  1. HBase
  2. HBASE-3851

A Random-Access Column Object Model

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.92.0
    • Fix Version/s: None
    • Component/s: Client
    • Labels:

      Description

      By design, a value in HBase is an opaque and atomic byte array. In theory, any arbitrary type can potentially be represented in terms of such unstructured yet indivisible units. However, as the complexity of the type increases, so does the need to access it in parts rather than in whole. That way, one can update parts of a value without reading the whole first. This calls for transparency in the type of data being accessed.

      To that end, we introduce here a simple object model where each part maps to a HTable column and value thereof. Specifically, we define a ColumnObject interface that denotes an arbitrary type comprising properties, where each property is a <name, value> tuple of byte arrays. In essence, each property maps to a distinct HBase KeyValue. In particular, the property's name maps to a column, prefixed by the qualifier and the object's identifier (assumed to be unique within a column family), and the property's value maps to the KeyValue#getValue() of the corresponding column. Furthermore, the ColumnObject is marked as a RandomAccess type to underscore the fact that its properties can be accessed in and of themselves.

      For starters, we provide three concrete objects - a ColumnMap, ColumnList and ColumnSet that implement the Map, List and Set interfaces respectively. The ColumnMap treats each Map.Entry as an object property, the ColumnList stores each element against its ordinal position, and the ColumnSet considers each element as the property name (as well as its value). For the sake of convenience, we also define extensions to the Get, Put, Delete and Result classes that are aware of and know how to deal with such ColumnObject types.

      1. HBASE-3851.patch
        54 kB
        Karthick Sankarachary

        Activity

        Hide
        Lars Hofhansl added a comment -

        Closing, as suggested.
        @Karthik: Do you want to attach the github link you mentioned?

        Show
        Lars Hofhansl added a comment - Closing, as suggested. @Karthik: Do you want to attach the github link you mentioned?
        Hide
        stack added a comment -

        Moving out of 0.92. Move it back in if you think differently.

        Show
        stack added a comment - Moving out of 0.92. Move it back in if you think differently.
        Hide
        stack added a comment -

        Moving out of 0.92. Move it back in if you think differently.

        Show
        stack added a comment - Moving out of 0.92. Move it back in if you think differently.
        Hide
        Karthick Sankarachary added a comment -

        Basically, the goal here is to reduce the number of round trips between the client and region servers. By way of example, let's say we've a table of user profiles, where the profile includes the users interests (a set of things they like) and portfolio (a map of stock symbol to price paid). If we put their interests (or portfolio) in a single column, then every time we want to add/remove an interest (stock), we'll most likely need to read that column prior to updating it. On the other hand, if we break down the interests (portfolio) into multiple columns, one for each element in the set (map), then that will allow us to add/remove elements without reading the entire collection first.

        Having said that, I took a look at the object-mappings proposed for some of the other NoSQL databases, and they all happen to live outside of the project proper. In light of that, I'll do as you suggested, and put this on github. If you'd like to revisit this down the road, please feel free to re-open this.

        Show
        Karthick Sankarachary added a comment - Basically, the goal here is to reduce the number of round trips between the client and region servers. By way of example, let's say we've a table of user profiles, where the profile includes the users interests (a set of things they like) and portfolio (a map of stock symbol to price paid). If we put their interests (or portfolio) in a single column, then every time we want to add/remove an interest (stock), we'll most likely need to read that column prior to updating it. On the other hand, if we break down the interests (portfolio) into multiple columns, one for each element in the set (map), then that will allow us to add/remove elements without reading the entire collection first. Having said that, I took a look at the object-mappings proposed for some of the other NoSQL databases, and they all happen to live outside of the project proper. In light of that, I'll do as you suggested, and put this on github. If you'd like to revisit this down the road, please feel free to re-open this.
        Hide
        Todd Lipcon added a comment -

        Sorry, I don't quite understand the motivation of this. As entirely client-side code, this seems like it might be a useful abstraction for some folks, but could live outside of HBase proper (eg a github project?)

        Show
        Todd Lipcon added a comment - Sorry, I don't quite understand the motivation of this. As entirely client-side code, this seems like it might be a useful abstraction for some folks, but could live outside of HBase proper (eg a github project?)

          People

          • Assignee:
            Karthick Sankarachary
            Reporter:
            Karthick Sankarachary
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development