HBase
  1. HBase
  2. HBASE-1755

Putting 'Meta' table into ZooKeeper

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.90.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Moving to 0.22.0

        Activity

        Hide
        Erik Holstad added a comment -

        Did some small testing to see how a node with a lot of children would be have and what the memory usage would be.
        These numbers were produced by running a single ZooKeeper node on my laptop and will be further tested on a bigger cluster
        in the beginning of next week, but just wanted to get some rough numbers.

        Inserted 10000 children and it didn't seem to cause any issues. Approximate memory usage for this insert seemed to be around 6MB, so about 600B/node which seems kinda reasonable when looking at the DataNode.java code in ZooKeeper.

        Show
        Erik Holstad added a comment - Did some small testing to see how a node with a lot of children would be have and what the memory usage would be. These numbers were produced by running a single ZooKeeper node on my laptop and will be further tested on a bigger cluster in the beginning of next week, but just wanted to get some rough numbers. Inserted 10000 children and it didn't seem to cause any issues. Approximate memory usage for this insert seemed to be around 6MB, so about 600B/node which seems kinda reasonable when looking at the DataNode.java code in ZooKeeper.
        Hide
        stack added a comment -

        "ZooKeeper was not designed to be a general database or large object store. Instead, it manages coordination data. This data can come in the form of configuration, status information, rendezvous, etc. A common property of the various forms of coordination data is that they are relatively small: measured in kilobytes. The ZooKeeper client and the server implementations have sanity checks to ensure that znodes have less than 1M of data, but the data should be much less than that on average. Operating on relatively large data sizes will cause some operations to take much more time than others and will affect the latencies of some operations because of the extra time needed to move more data over the network and onto storage media. If large data storage is needed, the usually pattern of dealing with such data is to store it on a bulk storage system, such as NFS or HDFS, and store pointers to the storage locations in ZooKeeper." http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperProgrammers.html

        Show
        stack added a comment - "ZooKeeper was not designed to be a general database or large object store. Instead, it manages coordination data. This data can come in the form of configuration, status information, rendezvous, etc. A common property of the various forms of coordination data is that they are relatively small: measured in kilobytes. The ZooKeeper client and the server implementations have sanity checks to ensure that znodes have less than 1M of data, but the data should be much less than that on average. Operating on relatively large data sizes will cause some operations to take much more time than others and will affect the latencies of some operations because of the extra time needed to move more data over the network and onto storage media. If large data storage is needed, the usually pattern of dealing with such data is to store it on a bulk storage system, such as NFS or HDFS, and store pointers to the storage locations in ZooKeeper." http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperProgrammers.html
        Hide
        Jonathan Gray added a comment -

        Each HRI, unoptimized, is probably about 400 bytes. If we minimally binary encode it, we're probably talking closer to 100-150 bytes for all the information of a region. Add another 32 bytes of overhead from the object itself, and call it 200 bytes per region at the high end. I don't think historian belongs in ZK at all.

        This is small, meta data. Several orders of magnitude smaller than 1M, and well below even 1K. These are not large objects.

        I do believe this could be a big win for two reasons. One, HBase has no db-level replication so requests for a segment of META will always go to a single node (this is the reason we still use some key/val caching at streamy on top of HBase for the most commonly read rows). Zookeeper replicates the data across all nodes so reads are fully-distributed. Two, the code dealing with .META. is nasty and has always caused problems. Doing something like alternate row order (ascending, for example) would be rather easy if done in ZK vs how we do it now.

        However, META as a special table in HBase does work now (and is not really a bottleneck yet)...

        So I vote to bump further discussion of this to 0.22. I'd like to get to the next release ASAP and there is a beast of a problem to solve (with ZK help) in our current assignment/cluster task/load balancing systems before moving META to ZK, if ever. If there was actual load balancing in HBase that balanced read load, it would help with both META as well as normal tables and potentially remove almost all need for caching outside of HBase.

        Show
        Jonathan Gray added a comment - Each HRI, unoptimized, is probably about 400 bytes. If we minimally binary encode it, we're probably talking closer to 100-150 bytes for all the information of a region. Add another 32 bytes of overhead from the object itself, and call it 200 bytes per region at the high end. I don't think historian belongs in ZK at all. This is small, meta data. Several orders of magnitude smaller than 1M, and well below even 1K. These are not large objects. I do believe this could be a big win for two reasons. One, HBase has no db-level replication so requests for a segment of META will always go to a single node (this is the reason we still use some key/val caching at streamy on top of HBase for the most commonly read rows). Zookeeper replicates the data across all nodes so reads are fully-distributed. Two, the code dealing with .META. is nasty and has always caused problems. Doing something like alternate row order (ascending, for example) would be rather easy if done in ZK vs how we do it now. However, META as a special table in HBase does work now (and is not really a bottleneck yet)... So I vote to bump further discussion of this to 0.22. I'd like to get to the next release ASAP and there is a beast of a problem to solve (with ZK help) in our current assignment/cluster task/load balancing systems before moving META to ZK, if ever. If there was actual load balancing in HBase that balanced read load, it would help with both META as well as normal tables and potentially remove almost all need for caching outside of HBase.
        Hide
        stack added a comment -

        HRI should shrink considerably when we have it reference table descriptor and column descriptors kept elsewhere in zk. -1 on binary encoding. Needs to be human readable, json?, if up in zk. But yeah, data should be getting smaller.

        (Chatting w/ J-D, we're thinking of dropping historian as a feature before its time and heavy-duty keeping it up in the meantime).

        Point taken on distribution.

        Agree to moving out of 0.21.

        Show
        stack added a comment - HRI should shrink considerably when we have it reference table descriptor and column descriptors kept elsewhere in zk. -1 on binary encoding. Needs to be human readable, json?, if up in zk. But yeah, data should be getting smaller. (Chatting w/ J-D, we're thinking of dropping historian as a feature before its time and heavy-duty keeping it up in the meantime). Point taken on distribution. Agree to moving out of 0.21.
        Hide
        Jonathan Gray added a comment -

        Not sure why it needs to be human-readable but don't have a strong opinion about it. You should be modifying with API or shell, not by editing json by hand? KV is not "human-readable" but doesn't mean we don't have a toString() human-readable form, etc.

        Show
        Jonathan Gray added a comment - Not sure why it needs to be human-readable but don't have a strong opinion about it. You should be modifying with API or shell, not by editing json by hand? KV is not "human-readable" but doesn't mean we don't have a toString() human-readable form, etc.
        Hide
        stack added a comment -

        human readable for debugging's sake

        Show
        stack added a comment - human readable for debugging's sake
        Hide
        jiangwen wei added a comment -

        i think zookeeper should be enhanced. let the children under a ZNode in order, and the comparator can be specified for each parent ZNode. so the client can find which region the key is in from zookeeper.

        and zookeeper should accept binary path, not only a string path.

        Show
        jiangwen wei added a comment - i think zookeeper should be enhanced. let the children under a ZNode in order, and the comparator can be specified for each parent ZNode. so the client can find which region the key is in from zookeeper. and zookeeper should accept binary path, not only a string path.
        Hide
        jiangwen wei added a comment -

        although ZooKeeper was not designed to be a general database or large object store. but there is no such limitation in the consistent algorithm behind ZK. ZK need to enhanced to be a meta data database, and the effort is very very small.
        only store a large number of nodes into ZK, the data associated with each node is very little.

        Show
        jiangwen wei added a comment - although ZooKeeper was not designed to be a general database or large object store. but there is no such limitation in the consistent algorithm behind ZK. ZK need to enhanced to be a meta data database, and the effort is very very small. only store a large number of nodes into ZK, the data associated with each node is very little.
        Hide
        jiangwen wei added a comment -

        although ZooKeeper was not designed to be a general database or large object store. but there is no such limitation in the consistent algorithm behind ZK. ZK need to enhanced to be a meta data database, and the effort is very very small.
        only store a large number of nodes into ZK, the data associated with each node is very little.

        Show
        jiangwen wei added a comment - although ZooKeeper was not designed to be a general database or large object store. but there is no such limitation in the consistent algorithm behind ZK. ZK need to enhanced to be a meta data database, and the effort is very very small. only store a large number of nodes into ZK, the data associated with each node is very little.
        Hide
        Ted Yu added a comment -

        In HBASE-3676, region load is reported to master through heartbeat. HBase-1502 removes heartbeat.
        So potentially ZK may host more information about the regions.

        Show
        Ted Yu added a comment - In HBASE-3676 , region load is reported to master through heartbeat. HBase-1502 removes heartbeat. So potentially ZK may host more information about the regions.
        Hide
        ryan rawson added a comment -

        I was originally excited about this but I have recently become not a fan. I think we should not store anything but temporary data in ZK. The reason is there are several really good properties that we have now that we'd lose:

        • Easy backup, just stop hbase, copy /hbase and you have a complete backup.
        • Snapshot abilities, right now if you were to take a FS level snapshot you'd have a perfect point in time backup.
        • There are solid tools for managing our hbase files, but none for ZK, this is a "minor" issue, but still the code would need to be written anyways.
        Show
        ryan rawson added a comment - I was originally excited about this but I have recently become not a fan. I think we should not store anything but temporary data in ZK. The reason is there are several really good properties that we have now that we'd lose: Easy backup, just stop hbase, copy /hbase and you have a complete backup. Snapshot abilities, right now if you were to take a FS level snapshot you'd have a perfect point in time backup. There are solid tools for managing our hbase files, but none for ZK, this is a "minor" issue, but still the code would need to be written anyways.
        Hide
        Jonathan Gray added a comment -

        I generally agree that we should store temporary data in ZK, but I see META as largely temporary.

        Table/region meta data is already persisted on HDFS (we don't properly update, but that can be fixed without much trouble). And we have plans to move schema and configuration information into ZK for online changes, so at least on a running cluster, we'll be depending on ZK for region configuration.

        Otherwise, META is largely for locations.

        I also think the possibility exists to keep a META region but maintain region locations in ZK.

        In general, the special casing and exception handling around the reading and updating of META is extraordinarily painful both in the master and in the regionservers.

        Show
        Jonathan Gray added a comment - I generally agree that we should store temporary data in ZK, but I see META as largely temporary. Table/region meta data is already persisted on HDFS (we don't properly update, but that can be fixed without much trouble). And we have plans to move schema and configuration information into ZK for online changes, so at least on a running cluster, we'll be depending on ZK for region configuration. Otherwise, META is largely for locations. I also think the possibility exists to keep a META region but maintain region locations in ZK. In general, the special casing and exception handling around the reading and updating of META is extraordinarily painful both in the master and in the regionservers.
        Hide
        stack added a comment -

        Moving out of 0.92.0. Pull it back in if you think different.

        Show
        stack added a comment - Moving out of 0.92.0. Pull it back in if you think different.
        Hide
        stack added a comment -

        We ain't going to do this. Tendency now is away from zk rather than deeper investment.

        Show
        stack added a comment - We ain't going to do this. Tendency now is away from zk rather than deeper investment.
        Hide
        Lars Hofhansl added a comment -

        Not sure I agree with that tendency, though.
        The problem is not ZK as such (IMHO), but more that we keep related state in many places. Moving all of meta to ZK - for example - would reduce that state duplication and be helpful.

        Show
        Lars Hofhansl added a comment - Not sure I agree with that tendency, though. The problem is not ZK as such (IMHO), but more that we keep related state in many places. Moving all of meta to ZK - for example - would reduce that state duplication and be helpful.
        Hide
        Honghua Feng added a comment -

        I agree with Lars Hofhansl in some sense. ZK is not the root of all evil, it has its own recommended use pattern, it's (very) suitable for scenarios that:

        1. needs persistent (hierarchical) storage, and this storage is the only holder for some truth
        2. the storage size is small
        3. the access to the storage is sparse
        4. a plus if have watch/notify mechanism for coding convenience, but the code using ZK should have inherent idempotence which cares only about the final state when it's notified (state machine code/logic cares about the total state transition, so ZK is not good for it)

        According to above:

        1. region location info in META table is not suitable to be in ZK: its size can be very large
        2. region assignment status info is not suitable to be in ZK: 1). restart of a big cluster with big number of regions(say 10K-100K regions) can lead to very heavy/frequent read/write to ZK during the restart phase; 2). assignment code/logic is more like a state machine, it expects to have the full knowledge of the state transition without missed state change(event); 3). assignment status info duplicate in both master memory and ZK, ZK is not the only truth holder all the time(actually it's prohibitive to reference ZK as the only truth for each such info query, currently it serves more for assignment status info recovering when master fails, seems it's introduced to survive assignment process in case of master failure, right?)
        3. replication info is quite suitable to be in ZK, since it matches all of the above characteristic

        Surely, if we embed a consensus lib in master, we actually have an inherent ZK within master ensemble, that way we can storage all different kinds of status/info with different access pattern in this 'inherent' ZK within master(except region location info which is too big to be in memory)

        In an ideal world where master never dies, we won't use ZK to store the status/info currently stored in ZK, right? the master memory is the only truth holder. But master can die, so we need to duplicate the status/info in both master and ZK(this can potentially introduce the info-duplication problem, but the duplicate info problem can be avoided, but at the cost of efficiency: now we need to always access ZK rather than memory, it's prohibitive for data with heavy access), no duplication problem if we always use ZK as the truth(actually we treat ZK as the only truth this way for replication info, the reasons include replication info data size is small, access is sparse, so we can afford to always access ZK for replication info, that's why I think ZK is good enough for replication info).
        By embedding zk(consensus lib) within master, the zk and master memory now combine as one place, no info duplicate, no access efficiency problem, still have persistence in case of master failure...

        Show
        Honghua Feng added a comment - I agree with Lars Hofhansl in some sense. ZK is not the root of all evil, it has its own recommended use pattern , it's (very) suitable for scenarios that: needs persistent (hierarchical) storage, and this storage is the only holder for some truth the storage size is small the access to the storage is sparse a plus if have watch/notify mechanism for coding convenience, but the code using ZK should have inherent idempotence which cares only about the final state when it's notified (state machine code/logic cares about the total state transition, so ZK is not good for it) According to above: region location info in META table is not suitable to be in ZK: its size can be very large region assignment status info is not suitable to be in ZK: 1). restart of a big cluster with big number of regions(say 10K-100K regions) can lead to very heavy/frequent read/write to ZK during the restart phase; 2). assignment code/logic is more like a state machine, it expects to have the full knowledge of the state transition without missed state change(event); 3). assignment status info duplicate in both master memory and ZK, ZK is not the only truth holder all the time(actually it's prohibitive to reference ZK as the only truth for each such info query, currently it serves more for assignment status info recovering when master fails, seems it's introduced to survive assignment process in case of master failure, right?) replication info is quite suitable to be in ZK, since it matches all of the above characteristic Surely, if we embed a consensus lib in master, we actually have an inherent ZK within master ensemble, that way we can storage all different kinds of status/info with different access pattern in this 'inherent' ZK within master(except region location info which is too big to be in memory) In an ideal world where master never dies, we won't use ZK to store the status/info currently stored in ZK, right? the master memory is the only truth holder. But master can die, so we need to duplicate the status/info in both master and ZK(this can potentially introduce the info-duplication problem, but the duplicate info problem can be avoided, but at the cost of efficiency: now we need to always access ZK rather than memory, it's prohibitive for data with heavy access), no duplication problem if we always use ZK as the truth(actually we treat ZK as the only truth this way for replication info, the reasons include replication info data size is small, access is sparse, so we can afford to always access ZK for replication info, that's why I think ZK is good enough for replication info ). By embedding zk(consensus lib) within master, the zk and master memory now combine as one place, no info duplicate, no access efficiency problem, still have persistence in case of master failure...
        Hide
        stack added a comment -

        The problem is not ZK as such (IMHO), but more that we keep related state in many places. Moving all of meta to ZK - for example - would reduce that state duplication and be helpful.

        IMO, the above is going in the wrong direction if only because zk is on other end of a network connection so it will never be good as the source of authoritative state when compared to having in-memory state inside in the process that is actually calling the shots.

        Honghua Feng You nailed it.

        Show
        stack added a comment - The problem is not ZK as such (IMHO), but more that we keep related state in many places. Moving all of meta to ZK - for example - would reduce that state duplication and be helpful. IMO, the above is going in the wrong direction if only because zk is on other end of a network connection so it will never be good as the source of authoritative state when compared to having in-memory state inside in the process that is actually calling the shots. Honghua Feng You nailed it.

          People

          • Assignee:
            Unassigned
            Reporter:
            Erik Holstad
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development