HBase
  1. HBase
  2. HBASE-10070

HBase read high-availability using timeline-consistent region replicas

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0, 1.2.0
    • Component/s: None
    • Labels:
      None

      Description

      In the present HBase architecture, it is hard, probably impossible, to satisfy constraints like 99th percentile of the reads will be served under 10 ms. One of the major factors that affects this is the MTTR for regions. There are three phases in the MTTR process - detection, assignment, and recovery. Of these, the detection is usually the longest and is presently in the order of 20-30 seconds. During this time, the clients would not be able to read the region data.

      However, some clients will be better served if regions will be available for reads during recovery for doing eventually consistent reads. This will help with satisfying low latency guarantees for some class of applications which can work with stale reads.

      For improving read availability, we propose a replicated read-only region serving design, also referred as secondary regions, or region shadows. Extending current model of a region being opened for reads and writes in a single region server, the region will be also opened for reading in region servers. The region server which hosts the region for reads and writes (as in current case) will be declared as PRIMARY, while 0 or more region servers might be hosting the region as SECONDARY. There may be more than one secondary (replica count > 2).

      Will attach a design doc shortly which contains most of the details and some thoughts about development approaches. Reviews are more than welcome.

      We also have a proof of concept patch, which includes the master and regions server side of changes. Client side changes will be coming soon as well.

        Issue Links

        1.
        HRegionInfo changes for adding replicaId and MetaEditor/MetaReader changes for region replicas Sub-task Closed Enis Soztutar
         
        2.
        HTableDescriptor changes for region replicas Sub-task Closed Devaraj Das
         
        3.
        Master/AM/RegionStates changes to create and assign region replicas Sub-task Closed Devaraj Das
         
        4.
        LoadBalancer changes for supporting region replicas Sub-task Closed Enis Soztutar
         
        5.
        Region and RegionServer changes for opening region replicas, and refreshing store files Sub-task Closed Enis Soztutar
         
        6.
        Add an API for defining consistency per request Sub-task Closed Enis Soztutar
         
        7.
        Failover RPC's from client using region replicas Sub-task Closed Nicolas Liochon
         
        8.
        Failover RPC's for multi-get Sub-task Closed Sergey Shelukhin
         
        9.
        Failover RPC's for scans Sub-task Closed Devaraj Das
         
        10. Shell changes for setting consistency per request Sub-task Open Enis Soztutar
         
        11.
        Master/RS WebUI changes for region replicas Sub-task Closed Devaraj Das
         
        12.
        Enable/AlterTable support for region replicas Sub-task Closed Devaraj Das
         
        13.
        HBCK changes for supporting region replicas Sub-task Closed Devaraj Das
         
        14.
        Provide user documentation for region replicas Sub-task Closed Enis Soztutar
         
        15.
        NPE in MetaCache.clearCache() Sub-task Closed Enis Soztutar
         
        16.
        Create an IntegrationTest for region replicas Sub-task Closed Enis Soztutar
         
        17.
        Integration test for multi-get calls Sub-task Closed Devaraj Das
         
        18.
        LoadBalancer.needsBalance() should check for co-located region replicas as well Sub-task Closed Devaraj Das
         
        19.
        NullPointerException in ConnectionManager$HConnectionImplementation.locateRegionInMeta() due to missing region info Sub-task Closed Ted Yu
         
        20.
        StoreFileRefresherChore throws ConcurrentModificationException sometimes Sub-task Closed Devaraj Das
         
        21.
        Multiget doesn't fully work Sub-task Closed Sergey Shelukhin
         
        22.
        TestStochasticLoadBalancer.testRegionReplicationOnMidClusterWithRacks() is flaky Sub-task Closed Enis Soztutar
         
        23.
        Table snapshot should handle tables whose REGION_REPLICATION is greater than one Sub-task Closed Devaraj Das
         
        24.
        HBCK should be updated to do replica related checks Sub-task Closed Devaraj Das
         
        25.
        Cache invalidation improvements from client side Sub-task Closed Enis Soztutar
         
        26.
        BaseLoadBalancer#roundRobinAssignment() may add same region to assignment plan multiple times Sub-task Closed Ted Yu
         
        27.
        TestAsyncProcess does not pass on HBASE-10070 Sub-task Resolved Nicolas Liochon
         
        28.
        Enable table doesn't balance out replicas evenly if the replicas were unassigned earlier Sub-task Closed Devaraj Das
         
        29.
        Fix RegionStates.getRegionAssignments to not add duplicate regions Sub-task Closed Devaraj Das
         
        30.
        Replica map update is problematic in RegionStates Sub-task Closed Devaraj Das
         
        31.
        Unique keys accounting in MultiThreadedReader is incorrect Sub-task Closed Ted Yu
         
        32.
        Add integration test to demonstrate performance improvement Sub-task Closed Nick Dimiduk
         
        33.
        multi-get should handle replica location missing from cache Sub-task Closed Sergey Shelukhin
         
        34.
        LoadTestTool should share the connection and connection pool Sub-task Closed Enis Soztutar
         
        35.
        Add integration test for bulkload with replicas Sub-task Closed Devaraj Das
         
        36.
        Add some tests on a real cluster for replica: multi master, replication Sub-task Closed Nicolas Liochon
         
        37.
        TestRegionRebalancing is failing Sub-task Closed Enis Soztutar
         
        38.
        Use HFileLink in opening region files from secondaries Sub-task Closed Enis Soztutar
         
        39.
        support parallel request cancellation for multi-get Sub-task Closed Devaraj Das
         
        40.
        HBASE-10070: HMaster can abort with NPE in #rebuildUserRegions Sub-task Closed Nicolas Liochon
         
        41.
        TestFromClientSideWithCoprocessor#testGetClosestRowBefore fails due to invalid block size Sub-task Resolved Ted Yu
         
        42.
        Fixes for scans on a replicated table Sub-task Closed Devaraj Das
         
        43.
        Timeline Consistent region replicas - Phase 2 design Sub-task Resolved Enis Soztutar
         
        44.
        Handle splitting/merging of regions that have region_replication greater than one Sub-task Closed Devaraj Das
         
        45.
        Fix for metas location cache from HBASE-10785 Sub-task Closed Enis Soztutar
         
        46.
        Write flush events to WAL Sub-task Closed Enis Soztutar
         
        47.
        Write region open/close events to WAL Sub-task Closed Enis Soztutar
         
        48.
        Write bulk load COMMIT events to WAL Sub-task Closed Alex Newman
         
        49.
        Async WAL replication for region replicas Sub-task Closed Enis Soztutar
         
        50.
        Flush / Compaction handling from secondary region replicas Sub-task Closed Enis Soztutar
         
        51.
        Bulk load handling from secondary region replicas Sub-task Closed Jeffrey Zhong
         
        52.
        RegionLocations::getRegionLocation can return unexpected replica Sub-task Resolved Unassigned
         
        53.
        Add support for doing get/scans against a particular replica_id Sub-task Closed Jeffrey Zhong
         
        54.
        hbase:meta's regions can be replicated Sub-task Closed Devaraj Das
         
        55.
        Failover handling for secondary region replicas Sub-task Closed Enis Soztutar
         
        56.
        Integration test for async wal replication to secondary regions Sub-task Closed Enis Soztutar
         
        57.
        Directly invoking split & merge of replica regions should be disallowed Sub-task Closed Devaraj Das
         
        58.
        Region replicas should be added to the meta table at the time of table creation Sub-task Closed Enis Soztutar
         
        59.
        Improve cancellation for the scan RPCs Sub-task Closed Devaraj Das
         
        60.
        Maintain SeqId monotonically increasing Sub-task Closed Jeffrey Zhong
         
        61.
        Replicas of regions can be cached from different instances of the table in MetaCache Sub-task Closed Enis Soztutar
         
        62.
        Handling memory pressure for secondary region replicas Sub-task Closed Enis Soztutar
         
        63.
        RegionReplicaReplicationEndpoint should not set the RPC Codec Sub-task Closed Enis Soztutar
         
        64.
        Disallow non-atomic update operations when TIMELINE consistency is enabled Sub-task Resolved Unassigned
         
        65.
        Async wal replication for region replicas and dist log replay does not work together Sub-task Closed Enis Soztutar
         
        66.
        ModifyTable increasing the region replica count should also auto-setup RRRE Sub-task Closed Enis Soztutar
         
        67.
        Split out locality metrics among primary and secondary region Sub-task Closed Ted Yu
         
        68.
        Handle FileNotFoundException in region replica replay for flush/compaction events Sub-task Closed Enis Soztutar
         
        69.
        Update documentation for 10070 Phase 2 changes Sub-task Resolved Enis Soztutar
         

          Activity

          Hide
          Enis Soztutar added a comment -

          Attaching a design doc for the feature. Comments welcome.

          Show
          Enis Soztutar added a comment - Attaching a design doc for the feature. Comments welcome.
          Hide
          Vladimir Rodionov added a comment -

          Of these, the detection is usually the longest and is presently in the order of 20-30 seconds.

          Any particular reason, why?

          Show
          Vladimir Rodionov added a comment - Of these, the detection is usually the longest and is presently in the order of 20-30 seconds. Any particular reason, why?
          Hide
          Jonathan Hsieh added a comment -

          This is great. I've been giving this a lot of thought recently and doing some experiments to see how feasible this is.

          Show
          Jonathan Hsieh added a comment - This is great. I've been giving this a lot of thought recently and doing some experiments to see how feasible this is.
          Hide
          Sergey Shelukhin added a comment -

          Any particular reason, why?

          To avoid false positives due to e.g. GC... we wait for some time before we consider RS dead

          Show
          Sergey Shelukhin added a comment - Any particular reason, why? To avoid false positives due to e.g. GC... we wait for some time before we consider RS dead
          Hide
          Jonathan Hsieh added a comment -

          Here's a link to my outline of the feature, looking forward to comparing the designs and concerns. https://docs.google.com/document/d/1q5kJTOA3sZ760sHkORGZNeWgNuMzP41PnAXtaCgPgEU/edit#heading=h.pyxl4wbui0l

          Show
          Jonathan Hsieh added a comment - Here's a link to my outline of the feature, looking forward to comparing the designs and concerns. https://docs.google.com/document/d/1q5kJTOA3sZ760sHkORGZNeWgNuMzP41PnAXtaCgPgEU/edit#heading=h.pyxl4wbui0l
          Hide
          Vladimir Rodionov added a comment -

          How about simple beacon process which is not affected by GC and sends "I am alive" messages to Zk every, say 100ms? HBase can spawn this process on start up. One can easily detect server failure (including VM swapping) in less than 1 sec in most cases. If server swaps than it should be marked as unavailable anyway (swapping is bad). Just saying.

          Show
          Vladimir Rodionov added a comment - How about simple beacon process which is not affected by GC and sends "I am alive" messages to Zk every, say 100ms? HBase can spawn this process on start up. One can easily detect server failure (including VM swapping) in less than 1 sec in most cases. If server swaps than it should be marked as unavailable anyway (swapping is bad). Just saying.
          Hide
          Enis Soztutar added a comment -

          Jonathan, great that your doc have very similar ideas to the one proposed in this issue, if not for a different main driver use case. We are focussing on read availability as a primary goal, and leave the primary promotion as a future goal (in the last section). Instead of a primary / shadow mark, we are proposing a replica_id, which inherently contains that information, plus enables to have more than one secondaries (shadows). In the region changes section, I detailed how we can do keep up with the primary region in a three-way proposal. As detailed, we do not want to impose any restrictions on the co-location of region replicas for the primaries hosted on the same server. Thus, in a wal-tailing case, we do not want to have every RS tailing the logs of every other RS. That is why I think it makes more sense to have this only in a wal-per-region world. Otherwise, I think we can tap into the replication and log replay work to tail our own logs and replicate to secondaries.

          Any details you can share for the experiments you did?

          Show
          Enis Soztutar added a comment - Jonathan, great that your doc have very similar ideas to the one proposed in this issue, if not for a different main driver use case. We are focussing on read availability as a primary goal, and leave the primary promotion as a future goal (in the last section). Instead of a primary / shadow mark, we are proposing a replica_id, which inherently contains that information, plus enables to have more than one secondaries (shadows). In the region changes section, I detailed how we can do keep up with the primary region in a three-way proposal. As detailed, we do not want to impose any restrictions on the co-location of region replicas for the primaries hosted on the same server. Thus, in a wal-tailing case, we do not want to have every RS tailing the logs of every other RS. That is why I think it makes more sense to have this only in a wal-per-region world. Otherwise, I think we can tap into the replication and log replay work to tail our own logs and replicate to secondaries. Any details you can share for the experiments you did?
          Hide
          Jonathan Hsieh added a comment -

          High level comparison after first read: These are are quite complimentary on goals and initial focuses.

          • enis focuses on stale read availability for phase 1, then reduced recovery time for phase 2. Jon focuses on reduced recovery time first, and then punts on stale read availability for a phase 2.
          • enis provides 3 options for log recovery, jon focuses on what enis calls the "wal tailing" approach.

          The requirement that I'm most concerned about recovering consistent reads as quickly as possible. I find the memstore / wal tail appealing most because we are essentially focusing the most on this current area of weakness.

          At the moment I've done some experiments on the wal tailing approach to see how much overhead it causes on the nn. I don't think we need to have a wal-per-region world to make it feasible – we only need to make sure that our shadow region are more a more like shadow regionservers – only have one or two of the replicas read the logs and make them the preferred places for the shadows. This basically acts as a constraint for the selection of where secondaries are assigned.

          Enis Soztutar in your doc does "group-based assignment" mean assigning multiple regions on a single transaction? or is this hdfs affinity groups? (if it is, then I agree, we need hdfs affinity groups for this to be efficient). However, I don't think we get into a situation where all RS's must read all other RS's logs – we only need to have the shadows RS's to read the primary RS's log.

          Show
          Jonathan Hsieh added a comment - High level comparison after first read: These are are quite complimentary on goals and initial focuses. enis focuses on stale read availability for phase 1, then reduced recovery time for phase 2. Jon focuses on reduced recovery time first, and then punts on stale read availability for a phase 2. enis provides 3 options for log recovery, jon focuses on what enis calls the "wal tailing" approach. The requirement that I'm most concerned about recovering consistent reads as quickly as possible . I find the memstore / wal tail appealing most because we are essentially focusing the most on this current area of weakness. At the moment I've done some experiments on the wal tailing approach to see how much overhead it causes on the nn. I don't think we need to have a wal-per-region world to make it feasible – we only need to make sure that our shadow region are more a more like shadow regionservers – only have one or two of the replicas read the logs and make them the preferred places for the shadows. This basically acts as a constraint for the selection of where secondaries are assigned. Enis Soztutar in your doc does "group-based assignment" mean assigning multiple regions on a single transaction? or is this hdfs affinity groups? (if it is, then I agree, we need hdfs affinity groups for this to be efficient). However, I don't think we get into a situation where all RS's must read all other RS's logs – we only need to have the shadows RS's to read the primary RS's log.
          Hide
          Devaraj Das added a comment -

          Vladimir Rodionov, we'd like to have the clients see the least downtime for their queries when the primary is not reachable for any reason (including temporary network partition). We want to be doubly sure that we are marking the server dead at the appropriate time - not too soon and not too late. That's why 20 seconds or so in a cluster of, say, 100 nodes, seems like a good value for a session timeout. Also, in practice there have seen cases where a node appears to be fine but then in reality it isn't (faulty disk and things like that) and that increases the latency of the responses. We are trying to address the use case where clients are willing to (knowingly)tolerate the staleness of the reads.

          But yeah we should be able to poll for existence of the RS process/node (from a separate process local or remote) and remove the ZK node when we discover that the RS process is down. Discussions around these issues are in HBASE-5843.

          Show
          Devaraj Das added a comment - Vladimir Rodionov , we'd like to have the clients see the least downtime for their queries when the primary is not reachable for any reason (including temporary network partition). We want to be doubly sure that we are marking the server dead at the appropriate time - not too soon and not too late. That's why 20 seconds or so in a cluster of, say, 100 nodes, seems like a good value for a session timeout. Also, in practice there have seen cases where a node appears to be fine but then in reality it isn't (faulty disk and things like that) and that increases the latency of the responses. We are trying to address the use case where clients are willing to (knowingly)tolerate the staleness of the reads. But yeah we should be able to poll for existence of the RS process/node (from a separate process local or remote) and remove the ZK node when we discover that the RS process is down. Discussions around these issues are in HBASE-5843 .
          Hide
          Enis Soztutar added a comment -

          Enis Soztutar in your doc does "group-based assignment" mean assigning multiple regions on a single transaction?

          I was trying to refer to not having co-location constraints for secondary replicas whose primaries are hosted by the same RS. For example, if R1(replica=0), and R2(replica=0) are hosted on RS1, R1(replica=1) and R2(replica=1) can be hosted by RS2 and RS3 respectively. This can definitely use the hdfs block affinity work though.

          However, I don't think we get into a situation where all RS's must read all other RS's logs – we only need to have the shadows RS's to read the primary RS's log.

          I am assuming a random distribution of secondary regions per above. In this case, for replication=2, a region server will have half of it's regions in primary and the other in secondary mode. For all the regions in the secondary mode, it has to tail the logs of the rs where the primary is hosted. However, since there is no co-location guarantee, the primaries are also randomly distributed. For n secondary regions, and m region servers, you will have to tail the logs of most of the RSs if n > m with a high probability (I do not have the smarts to calculate the exact probability)

          Show
          Enis Soztutar added a comment - Enis Soztutar in your doc does "group-based assignment" mean assigning multiple regions on a single transaction? I was trying to refer to not having co-location constraints for secondary replicas whose primaries are hosted by the same RS. For example, if R1(replica=0), and R2(replica=0) are hosted on RS1, R1(replica=1) and R2(replica=1) can be hosted by RS2 and RS3 respectively. This can definitely use the hdfs block affinity work though. However, I don't think we get into a situation where all RS's must read all other RS's logs – we only need to have the shadows RS's to read the primary RS's log. I am assuming a random distribution of secondary regions per above. In this case, for replication=2, a region server will have half of it's regions in primary and the other in secondary mode. For all the regions in the secondary mode, it has to tail the logs of the rs where the primary is hosted. However, since there is no co-location guarantee, the primaries are also randomly distributed. For n secondary regions, and m region servers, you will have to tail the logs of most of the RSs if n > m with a high probability (I do not have the smarts to calculate the exact probability)
          Hide
          Devaraj Das added a comment -

          Jonathan Hsieh, WAL per region (WALpr) would give you the locality (and hence HDFS short circuit) of reads if you were to couple it with the favored nodes. The cost is of course more WAL files... In the current situation (no WALpr) it would create quite some traffic cross machine, no?

          Show
          Devaraj Das added a comment - Jonathan Hsieh , WAL per region (WALpr) would give you the locality (and hence HDFS short circuit) of reads if you were to couple it with the favored nodes. The cost is of course more WAL files... In the current situation (no WALpr) it would create quite some traffic cross machine, no?
          Hide
          Jonathan Hsieh added a comment -

          This discussion might be better to have on the dev@ list then here. Shall we start a thread there?

          Show
          Jonathan Hsieh added a comment - This discussion might be better to have on the dev@ list then here. Shall we start a thread there?
          Hide
          Jonathan Hsieh added a comment -

          I've posted responses to Enis Soztutar and Devaraj Das on dev@

          Show
          Jonathan Hsieh added a comment - I've posted responses to Enis Soztutar and Devaraj Das on dev@
          Hide
          kishore gopalakrishna added a comment -

          Have we considered using Apache Helix. It provides most of the features required to achieve this. At LinkedIn, we have solved a similar problem for Espresso (http://www.slideshare.net/amywtang/espresso-20952131) where we have primary and secondary for each shard. Secondary is used to read when primary is unavailable. Helix also promotes the secondary to primary when the primary goes down.

          Show
          kishore gopalakrishna added a comment - Have we considered using Apache Helix. It provides most of the features required to achieve this. At LinkedIn, we have solved a similar problem for Espresso ( http://www.slideshare.net/amywtang/espresso-20952131 ) where we have primary and secondary for each shard. Secondary is used to read when primary is unavailable. Helix also promotes the secondary to primary when the primary goes down.
          Hide
          Devaraj Das added a comment -

          @Kishore, one of the guiding thoughts is that HBase core should support this feature natively rather than use another infrastructure.

          Show
          Devaraj Das added a comment - @Kishore, one of the guiding thoughts is that HBase core should support this feature natively rather than use another infrastructure.
          Hide
          Enis Soztutar added a comment -

          Update: as discussed, we have put a proof-of-concept implementation for a working end-to-end scenario, and would like to share that to get some early reviews and feedback. If you are interested on the technical side of the changes, please check the patch/branch out. Please note that the patches and the branch is far from being clean and complete, but otherwise clean enough to understand the scope of changes and areas that are touched. This also contains the end-to-end API's at the client side (except for execution policies). We will continue to work on the patches to get them in a more mature state, and recreate and clean up the patches for reviews, but at any stage, reviews / comments are welcome. We will keep pushing the changes to this repo / branch until the patches are in a more stable state, at which point, we will work on cleaning up and shuffling the patches to be more consumable by reviews.

          The code is at github repo: https://github.com/enis/hbase.git, and the branch is hbase-10070-demo. This repository is based on 0.96.0 for now. I'll also attach a patch which contains all the changes if you want to take a closer look.
          This can be build with:

          git clone git@github.com:enis/hbase.git
          cd hbase 
          git checkout hbase-10070-demo 
          MAVEN_OPTS="-Xmx2g" mvn clean install package assembly:single -Dhadoop.profile=2.0  -DskipTests  -Dmaven.javadoc.skip=true -Dhadoop-two.version=2.2.0
          

          The tar ball generated would be hbase-assembly/target/hbase-0.96.0-bin.tar.gz
          The hadoop version that should be used for real cluster testing is 2.2.0.

          What's there in the repository:
          1. Client (Shell) changes
          The shell has been modified so that tables with more than one replica per region can be created:
          create 't1', 'f1',

          {REGION_REPLICATION => 2}
          One can also 'describe' a table and that will have the replica configuration in the response string:
          describe 't1'
          One can do a 'get' with the eventual consistency flag set (we haven't implemented the consistency semantics for the 'scan' family in this drop):
          get 't2','r6',{"EVENTUAL_CONSISTENCY" => true}
          [NOTE THE quotes around the EVENTUAL_CONSISTENCY string. Will fix this soon to work without the quotes.]

          Outside the shell, the API to do with setting the willingness to tolerate eventual consistency is Get.setConsistency and the returned Result can be queried if it is stale or not via Result.isStale

          2. Master changes
          The one main change here is about creation and management of replica HRegionInfo objects. The other change is to make the StochasticLoadBalancer aware of the replicas. During the assignment process, the Assignment Manager consults the balancer to give it a plan for the assignment - here the StochasticLoadBalancer ensures that the plan takes into account the constraint - primary/secondary not assigned to the same server, same rack (if more than one rack configured).

          3. RegionServer changes
          The one main change here is to be able to open regions in readonly mode. The other change here is to do with the periodic refresh of store files. The configuration that sets this up is (this is disabled by default):
           <property>
             <name>hbase.regionserver.storefile.refresh.period</name>
             <value>2000</value>
           </property>
          

          4. UI changes
          The UIs corresponding to the tables' status and the regions' status have been modified to say whether they have replicas.

          There are unit tests - TestMasterReplicaRegions,TestRegionReplicas,TestReplicasClient,TestBaseLoadBalancer and some others.

          There is also a manual test scenario to test out reads coming from the secondary replica:
          1. create a (at least) two node cluster.
          2. create a table with replica 2. From HBase shell:
          create 't1', 'f1', {REGION_REPLICATION => 2}

          3. arrange the regions so that the the primary region is not co-located with meta or namespace regions.
          You can use move commands in HBase shell for that:
          move '4392870ae8ef482406c272eec0312a02', '192.168.0.106,60020,1387069812919'

          4. from the shell do a couple of puts and then 'flush' the table from the shell

          hbase(main):005:0> for i in 1..100
          hbase(main):006:1> put 't1', "r#{i}", 'f1:c1', i
          hbase(main):007:1> end
          hbase(main):009:0> flush 't1'
          

          5. suspend the region server which is hosting the primary region replica by sending kill -STOP signal from bash:

          kill -STOP <pid_of_region_server> 
          

          6. get a row from the table with eventual-consistency flag set to true.

           get 't2','r6',{"EVENTUAL_CONSISTENCY" => true}
          

          7. put should fail
          The (6) and (7) steps should be done quickly enough otherwise the master would recover the region!! (Default ZK session timeout is 90 seconds)

          Show
          Enis Soztutar added a comment - Update: as discussed, we have put a proof-of-concept implementation for a working end-to-end scenario, and would like to share that to get some early reviews and feedback. If you are interested on the technical side of the changes, please check the patch/branch out. Please note that the patches and the branch is far from being clean and complete, but otherwise clean enough to understand the scope of changes and areas that are touched. This also contains the end-to-end API's at the client side (except for execution policies). We will continue to work on the patches to get them in a more mature state, and recreate and clean up the patches for reviews, but at any stage, reviews / comments are welcome. We will keep pushing the changes to this repo / branch until the patches are in a more stable state, at which point, we will work on cleaning up and shuffling the patches to be more consumable by reviews. The code is at github repo: https://github.com/enis/hbase.git , and the branch is hbase-10070-demo. This repository is based on 0.96.0 for now. I'll also attach a patch which contains all the changes if you want to take a closer look. This can be build with: git clone git@github.com:enis/hbase.git cd hbase git checkout hbase-10070-demo MAVEN_OPTS= "-Xmx2g" mvn clean install package assembly:single -Dhadoop.profile=2.0 -DskipTests -Dmaven.javadoc.skip= true -Dhadoop-two.version=2.2.0 The tar ball generated would be hbase-assembly/target/hbase-0.96.0-bin.tar.gz The hadoop version that should be used for real cluster testing is 2.2.0. What's there in the repository: 1. Client (Shell) changes The shell has been modified so that tables with more than one replica per region can be created: create 't1', 'f1', {REGION_REPLICATION => 2} One can also 'describe' a table and that will have the replica configuration in the response string: describe 't1' One can do a 'get' with the eventual consistency flag set (we haven't implemented the consistency semantics for the 'scan' family in this drop): get 't2','r6',{"EVENTUAL_CONSISTENCY" => true} [NOTE THE quotes around the EVENTUAL_CONSISTENCY string. Will fix this soon to work without the quotes.] Outside the shell, the API to do with setting the willingness to tolerate eventual consistency is Get.setConsistency and the returned Result can be queried if it is stale or not via Result.isStale 2. Master changes The one main change here is about creation and management of replica HRegionInfo objects. The other change is to make the StochasticLoadBalancer aware of the replicas. During the assignment process, the Assignment Manager consults the balancer to give it a plan for the assignment - here the StochasticLoadBalancer ensures that the plan takes into account the constraint - primary/secondary not assigned to the same server, same rack (if more than one rack configured). 3. RegionServer changes The one main change here is to be able to open regions in readonly mode. The other change here is to do with the periodic refresh of store files. The configuration that sets this up is (this is disabled by default): <property> <name>hbase.regionserver.storefile.refresh.period</name> <value>2000</value> </property> 4. UI changes The UIs corresponding to the tables' status and the regions' status have been modified to say whether they have replicas. There are unit tests - TestMasterReplicaRegions,TestRegionReplicas,TestReplicasClient,TestBaseLoadBalancer and some others. There is also a manual test scenario to test out reads coming from the secondary replica: 1. create a (at least) two node cluster. 2. create a table with replica 2. From HBase shell: create 't1', 'f1', {REGION_REPLICATION => 2} 3. arrange the regions so that the the primary region is not co-located with meta or namespace regions. You can use move commands in HBase shell for that: move '4392870ae8ef482406c272eec0312a02', '192.168.0.106,60020,1387069812919' 4. from the shell do a couple of puts and then 'flush' the table from the shell hbase(main):005:0> for i in 1..100 hbase(main):006:1> put 't1', "r#{i}" , 'f1:c1', i hbase(main):007:1> end hbase(main):009:0> flush 't1' 5. suspend the region server which is hosting the primary region replica by sending kill -STOP signal from bash: kill -STOP <pid_of_region_server> 6. get a row from the table with eventual-consistency flag set to true. get 't2','r6',{ "EVENTUAL_CONSISTENCY" => true } 7. put should fail The (6) and (7) steps should be done quickly enough otherwise the master would recover the region!! (Default ZK session timeout is 90 seconds)
          Hide
          stack added a comment -

          Here's a few comment on the JIRA content.

          In the present HBase architecture, it is hard, probably impossible, to satisfy constraints like 99th percentile of the reads will be served under 10 ms.

          Should this be an architectural objective for HBase? Just asking. Our inspiration addressed the 99th percentile in a layer above.

          Of these, the detection is usually the longest and is presently in the order of 20-30 seconds.

          We should work on this for sure. Native zk client immune to JVM pause has come up in the past. Would help all around (as per the Vladimir comment above)

          However, some clients will be better served if regions will be available for reads during recovery for doing eventually consistent reads.

          Radical! Our DNA up to this has been all about giving the application a consistent view.

          For improving read availability, we propose a replicated read-only region serving design, also referred as secondary regions, or region shadows.

          Could this be build as a layer on top of HBase rather than alter HBase core with shims on clients and CPs?

          Do you envision this feature being always on? Or can it be disabled? If the former (or latter actually), what implications for current read/write paths do you see?

          This is far along. I'm late to the game. Sorry about that. Let me take a look at design and github.

          Show
          stack added a comment - Here's a few comment on the JIRA content. In the present HBase architecture, it is hard, probably impossible, to satisfy constraints like 99th percentile of the reads will be served under 10 ms. Should this be an architectural objective for HBase? Just asking. Our inspiration addressed the 99th percentile in a layer above. Of these, the detection is usually the longest and is presently in the order of 20-30 seconds. We should work on this for sure. Native zk client immune to JVM pause has come up in the past. Would help all around (as per the Vladimir comment above) However, some clients will be better served if regions will be available for reads during recovery for doing eventually consistent reads. Radical! Our DNA up to this has been all about giving the application a consistent view. For improving read availability, we propose a replicated read-only region serving design, also referred as secondary regions, or region shadows. Could this be build as a layer on top of HBase rather than alter HBase core with shims on clients and CPs? Do you envision this feature being always on? Or can it be disabled? If the former (or latter actually), what implications for current read/write paths do you see? This is far along. I'm late to the game. Sorry about that. Let me take a look at design and github.
          Hide
          Enis Soztutar added a comment -

          Should this be an architectural objective for HBase? Just asking. Our inspiration addressed the 99th percentile in a layer above.

          I think we should still focus on individual read latencies and try ti minimize the jitter. Obviously, things like hdfs quorum reads, etc are helpful in this respect, and we also plan to incorporate that kind of work together with this.

          We should work on this for sure. Native zk client immune to JVM pause has come up in the past. Would help all around (as per the Vladimir comment above)

          Agreed. But MTTR is orthogonal I think. In a region being single-homed world, there is no way you can get away without some timeout. We can try to reduce it in cases, but a network partition can always happen.

          Radical! Our DNA up to this has been all about giving the application a consistent view.

          Yep, we are not proposing to change the default semantics, just giving the flexibility if the tradeoffs are justifiable on the user side.

          Could this be build as a layer on top of HBase rather than alter HBase core with shims on clients and CPs?

          I think the most clean way is to bake this into HBase proper. These are some of the reasons we went with this instead of proposing a layer above:

          • Regardless of eventual consistency for writes, Replicated read only tables or bulk-load only tables are one of the major design goals for this work as well. This can and should be addressed natively by HBase I would argue. The eventual consistency work just extends this further on a use case basis.
          • RPC failover + RPC cancellation is not possible to do from outside (or at least easily)
          • A higher level API cannot easily tap into LB to ensure that region replicas are not co-hosted.

          Do you envision this feature being always on? Or can it be disabled? If the former (or latter actually), what implications for current read/write paths do you see?

          The branch adds REGION_REPLICATION which is a per-table conf, and get/scan.setConsistency() API which is per request. The write path is not affected at all. On the read path, we do a failover (backup) RPC similar to http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf.

          Show
          Enis Soztutar added a comment - Should this be an architectural objective for HBase? Just asking. Our inspiration addressed the 99th percentile in a layer above. I think we should still focus on individual read latencies and try ti minimize the jitter. Obviously, things like hdfs quorum reads, etc are helpful in this respect, and we also plan to incorporate that kind of work together with this. We should work on this for sure. Native zk client immune to JVM pause has come up in the past. Would help all around (as per the Vladimir comment above) Agreed. But MTTR is orthogonal I think. In a region being single-homed world, there is no way you can get away without some timeout. We can try to reduce it in cases, but a network partition can always happen. Radical! Our DNA up to this has been all about giving the application a consistent view. Yep, we are not proposing to change the default semantics, just giving the flexibility if the tradeoffs are justifiable on the user side. Could this be build as a layer on top of HBase rather than alter HBase core with shims on clients and CPs? I think the most clean way is to bake this into HBase proper. These are some of the reasons we went with this instead of proposing a layer above: Regardless of eventual consistency for writes, Replicated read only tables or bulk-load only tables are one of the major design goals for this work as well. This can and should be addressed natively by HBase I would argue. The eventual consistency work just extends this further on a use case basis. RPC failover + RPC cancellation is not possible to do from outside (or at least easily) A higher level API cannot easily tap into LB to ensure that region replicas are not co-hosted. Do you envision this feature being always on? Or can it be disabled? If the former (or latter actually), what implications for current read/write paths do you see? The branch adds REGION_REPLICATION which is a per-table conf, and get/scan.setConsistency() API which is per request. The write path is not affected at all. On the read path, we do a failover (backup) RPC similar to http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf .
          Hide
          Enis Soztutar added a comment -

          Thanks for looking BTW.

          Show
          Enis Soztutar added a comment - Thanks for looking BTW.
          Hide
          stack added a comment -

          Feedback on the design:

          "...some clients will be better served if regions will be available for reads during recovery for doing eventually consistent reads"

          Has anyone asked for this feature on the list or in issues? You have a hard user for this feature or is this speculative work? If so, is the user/customer looking explicitly to be able to do stale reads or is giving them stale data a compromise on what they are actually asking for (Let me guess, HA consistent view). If HA consistent view is what they want, should we work on that instead?

          What delivery timeline are we talking? Is this is for 1.0 hbase? Or is it post 1.0? If the later, maybe we want to go more radical than what is being proposed here.

          However, some clients will be better served if regions will be available for reads during recovery for doing eventually consistent reads.

          Should such clients use another data store, one that allows eventually consistent views? Clients having to deal with sometimes stale data will be more involved (I like this deduction from the F1 paper "Designing applications to cope with concurrency anomalies in their data is very error-prone, time-consuming, and ultimately not worth the performance gains").

          Pardon all the questions. I am concerned that a prime directive, consistent view, is being softened. As is, its easy saying what we are. Going forward, lets not get to a spot where we have to answer "It is complicated..." when asked if we are a consistent store or not.

          This document is nicely written.

          Here are some random comments as I go through the doc:

          + Could we implement this feature with some minor changes in core and then stuff like clients that can do the read replica dance done as subclass of current client – a read replica client – or at a layer above current client?
          + Why notions of secondary and tertiary? Isn't it primary and replica only?
          + "Two region replicas cannot be hosted at the same RS (hard)" If RS count is < # of replicas, this is relaxed I'm sure (hard becomes soft). Hmm... this seems complicated: "If LB cannot assign secondary replicas due to above hard constraints, the replica will be added to this map. In case of new region servers joining the cluster, LB will be invoked over this map." Suggest dropping it for first cut.
          + Seems wrong having RCP consious of replicas. I'd think this managed at higher levels up in HCM.

          This design looks viable to me and the document is of good quality.

          Show
          stack added a comment - Feedback on the design: "...some clients will be better served if regions will be available for reads during recovery for doing eventually consistent reads" Has anyone asked for this feature on the list or in issues? You have a hard user for this feature or is this speculative work? If so, is the user/customer looking explicitly to be able to do stale reads or is giving them stale data a compromise on what they are actually asking for (Let me guess, HA consistent view). If HA consistent view is what they want, should we work on that instead? What delivery timeline are we talking? Is this is for 1.0 hbase? Or is it post 1.0? If the later, maybe we want to go more radical than what is being proposed here. However, some clients will be better served if regions will be available for reads during recovery for doing eventually consistent reads. Should such clients use another data store, one that allows eventually consistent views? Clients having to deal with sometimes stale data will be more involved (I like this deduction from the F1 paper "Designing applications to cope with concurrency anomalies in their data is very error-prone, time-consuming, and ultimately not worth the performance gains"). Pardon all the questions. I am concerned that a prime directive, consistent view, is being softened. As is, its easy saying what we are. Going forward, lets not get to a spot where we have to answer "It is complicated..." when asked if we are a consistent store or not. This document is nicely written. Here are some random comments as I go through the doc: + Could we implement this feature with some minor changes in core and then stuff like clients that can do the read replica dance done as subclass of current client – a read replica client – or at a layer above current client? + Why notions of secondary and tertiary? Isn't it primary and replica only? + "Two region replicas cannot be hosted at the same RS (hard)" If RS count is < # of replicas, this is relaxed I'm sure (hard becomes soft). Hmm... this seems complicated: "If LB cannot assign secondary replicas due to above hard constraints, the replica will be added to this map. In case of new region servers joining the cluster, LB will be invoked over this map." Suggest dropping it for first cut. + Seems wrong having RCP consious of replicas. I'd think this managed at higher levels up in HCM. This design looks viable to me and the document is of good quality.
          Hide
          Enis Soztutar added a comment -

          Has anyone asked for this feature on the list or in issues? You have a hard user for this feature or is this speculative work? If so, is the user/customer looking explicitly to be able to do stale reads or is giving them stale data a compromise on what they are actually asking for (Let me guess, HA consistent view). If HA consistent view is what they want, should we work on that instead?

          Should such clients use another data store, one that allows eventually consistent views? Clients having to deal with sometimes stale data will be more involved

          We have a customer use case which is the main driver, but while in the design stage, we are also having some interest from other prospects as well. The main use case is actually not HA consistent view. Eventual consistency is a kind of a misnomer actually, the main use case is to be able to read at least some data back in case of a failover. It is more like a "timeline consistency" rather than the eventual consistency a-la dynamo. That data might be stale as long as it is acknowledged as it is, but for serving the data to the web tier, there should be a better guarantee than our current MTTR story.

          I am concerned that a prime directive, consistent view, is being softened. As is, its easy saying what we are. Going forward, lets not get to a spot where we have to answer "It is complicated..." when asked if we are a consistent store or not.

          Again, the main semantics is not changed. As per the tradeoffs section, we are trying to add the flexibility for some tradeoff. Our whole CAP story is not a black-or-white choice. We are strongly consistent for single row updates, while highly available across regions (an RS going down does not affect other regions not hosted there), and eventual consistent across-DC.

          Could we implement this feature with some minor changes in core and then stuff like clients that can do the read replica dance done as subclass of current client – a read replica client – or at a layer above current client?

          Seems wrong having RCP consious of replicas. I'd think this managed at higher levels up in HCM.

          Nicolas Liochon do you want to chime in?

          Why notions of secondary and tertiary? Isn't it primary and replica only?

          I should revisit wording I guess. I think we are using the terminology primary <=> replicaId =0, secondaries <=> replicaId > 0, and secondary <=> replicaId = 1, tertiary <=> replicaId = 2, etc.

          "Two region replicas cannot be hosted at the same RS (hard)" If RS count is < # of replicas, this is relaxed I'm sure (hard becomes soft). Hmm... this seems complicated:

          Agreed. I'll update the doc reflecting that. Initially, I though we should do underReplicated regions kind of thing, but it requires more intrusive changes to AM / LB since we have to keep these around etc. Now the LB design is simpler in that, we just try to not co-host region replicas, but if we cannot (in case replication > # RS or # racks etc) we simply assign the regions anyway.

          This design looks viable to me and the document is of good quality.

          Thanks

          Show
          Enis Soztutar added a comment - Has anyone asked for this feature on the list or in issues? You have a hard user for this feature or is this speculative work? If so, is the user/customer looking explicitly to be able to do stale reads or is giving them stale data a compromise on what they are actually asking for (Let me guess, HA consistent view). If HA consistent view is what they want, should we work on that instead? Should such clients use another data store, one that allows eventually consistent views? Clients having to deal with sometimes stale data will be more involved We have a customer use case which is the main driver, but while in the design stage, we are also having some interest from other prospects as well. The main use case is actually not HA consistent view. Eventual consistency is a kind of a misnomer actually, the main use case is to be able to read at least some data back in case of a failover. It is more like a "timeline consistency" rather than the eventual consistency a-la dynamo. That data might be stale as long as it is acknowledged as it is, but for serving the data to the web tier, there should be a better guarantee than our current MTTR story. I am concerned that a prime directive, consistent view, is being softened. As is, its easy saying what we are. Going forward, lets not get to a spot where we have to answer "It is complicated..." when asked if we are a consistent store or not. Again, the main semantics is not changed. As per the tradeoffs section, we are trying to add the flexibility for some tradeoff. Our whole CAP story is not a black-or-white choice. We are strongly consistent for single row updates, while highly available across regions (an RS going down does not affect other regions not hosted there), and eventual consistent across-DC. Could we implement this feature with some minor changes in core and then stuff like clients that can do the read replica dance done as subclass of current client – a read replica client – or at a layer above current client? Seems wrong having RCP consious of replicas. I'd think this managed at higher levels up in HCM. Nicolas Liochon do you want to chime in? Why notions of secondary and tertiary? Isn't it primary and replica only? I should revisit wording I guess. I think we are using the terminology primary <=> replicaId =0, secondaries <=> replicaId > 0, and secondary <=> replicaId = 1, tertiary <=> replicaId = 2, etc. "Two region replicas cannot be hosted at the same RS (hard)" If RS count is < # of replicas, this is relaxed I'm sure (hard becomes soft). Hmm... this seems complicated: Agreed. I'll update the doc reflecting that. Initially, I though we should do underReplicated regions kind of thing, but it requires more intrusive changes to AM / LB since we have to keep these around etc. Now the LB design is simpler in that, we just try to not co-host region replicas, but if we cannot (in case replication > # RS or # racks etc) we simply assign the regions anyway. This design looks viable to me and the document is of good quality. Thanks
          Hide
          stack added a comment -

          It is more like a "timeline consistency" rather than the eventual consistency a-la dynamo.

          Let us call it 'timeline consistency' from here on out then while it still in the inception phase. Let us not allow 'eventually consistent' anywhere near our lexicon else we will only confuse.

          Show
          stack added a comment - It is more like a "timeline consistency" rather than the eventual consistency a-la dynamo. Let us call it 'timeline consistency' from here on out then while it still in the inception phase. Let us not allow 'eventually consistent' anywhere near our lexicon else we will only confuse.
          Hide
          Enis Soztutar added a comment -

          What delivery timeline are we talking? Is this is for 1.0 hbase? Or is it post 1.0? If the later, maybe we want to go more radical than what is being proposed here.

          Forgot to answer that earlier. I think it will depend on the timing of 1.0 vs timing of this branch. I think we can get all of the patches (for phase 1) in a mature and reviewed state in 1-2 months. Then at the time of the merge proposal, we can see how far along we are at the 1.0 release candidate. AFAIK, current trunk does not contain anything that is not there in 0.98, so we that means whether we should go with 1.0 = a mature version of 0.98 approach or 1.0 = mature version of 0.99 approach. I prefer the latter, but that is up for discussion in the mailing list I guess.

          Show
          Enis Soztutar added a comment - What delivery timeline are we talking? Is this is for 1.0 hbase? Or is it post 1.0? If the later, maybe we want to go more radical than what is being proposed here. Forgot to answer that earlier. I think it will depend on the timing of 1.0 vs timing of this branch. I think we can get all of the patches (for phase 1) in a mature and reviewed state in 1-2 months. Then at the time of the merge proposal, we can see how far along we are at the 1.0 release candidate. AFAIK, current trunk does not contain anything that is not there in 0.98, so we that means whether we should go with 1.0 = a mature version of 0.98 approach or 1.0 = mature version of 0.99 approach. I prefer the latter, but that is up for discussion in the mailing list I guess.
          Hide
          stack added a comment -

          Ok on the timing. You know how I feel about 1.0 – sooner rather than later – but hopefully this feature gets done in time.

          Looking at HBASE-10347, I have a 'design level' concern so let me raise it here rather than there. Let me repeat a comment I made there:

          After thinking more on this, I 'get' why you have the replicas listed inside in the row rather than as rows themselves [in hbase:meta]. The row in hbase:meta becomes a proxy or facade for the little cluster of regions one of which is the primary with the others read replicas. If that is the case, lets recognize it as so and make proper accommodation in the code base and model.

          Problems I see are:

          + HRegionInfo now is overloaded. Before it was the info on a specific region. Now it is trying to serve two purposes; its original intent and now too as a descriptor on the region-serving 'cluster' made of a primary and replicas. Lets avoid overloading what up to this has had a clear role in the hbase model.
          + The primary holds the 'pole position' being the name of the region in meta. The read replicas are differently named with the 00001 and 00002, etc., interpolated into the middle of the region name. I suppose doing it this way 'minimizes' the disturbance in the code base but I'm worried this naming exception will only confuse though it minimizes change. Why would the primary not be named like the replica regions?

          On the latter I can hear a reply that goes, "For those who do not need read replicas, then they will be unaffected", which I would counter ensures that this feature will forever be ghetto and no one will use it because it unexercised.

          Trying to ensure that we do not paint ourselves into a corner and to avoid the ghetto, looking beyond read replicas to full-on quorum read/writes, I can imagine we'd need some means like the above where the hbase:meta row name is not longer the physical name of a region but rather a logical name. The primary region in the quorum in read replicas is the region number 00000 but doing quorum read/writes, the leader will need to be able to change over the life of the quorum.

          Going forward, all regions get an index? By default the index is zero. When replicas or quorum members, the indices distingush members. When read replicas the region with index 0 is primary. When a quorum, the index has no special meaning. In the past we have had two naming conventions for regions live side by side in the one live cluster. We could do it again.

          Show
          stack added a comment - Ok on the timing. You know how I feel about 1.0 – sooner rather than later – but hopefully this feature gets done in time. Looking at HBASE-10347 , I have a 'design level' concern so let me raise it here rather than there. Let me repeat a comment I made there: After thinking more on this, I 'get' why you have the replicas listed inside in the row rather than as rows themselves [in hbase:meta] . The row in hbase:meta becomes a proxy or facade for the little cluster of regions one of which is the primary with the others read replicas. If that is the case, lets recognize it as so and make proper accommodation in the code base and model. Problems I see are: + HRegionInfo now is overloaded. Before it was the info on a specific region. Now it is trying to serve two purposes; its original intent and now too as a descriptor on the region-serving 'cluster' made of a primary and replicas. Lets avoid overloading what up to this has had a clear role in the hbase model. + The primary holds the 'pole position' being the name of the region in meta. The read replicas are differently named with the 00001 and 00002, etc., interpolated into the middle of the region name. I suppose doing it this way 'minimizes' the disturbance in the code base but I'm worried this naming exception will only confuse though it minimizes change. Why would the primary not be named like the replica regions? On the latter I can hear a reply that goes, "For those who do not need read replicas, then they will be unaffected", which I would counter ensures that this feature will forever be ghetto and no one will use it because it unexercised. Trying to ensure that we do not paint ourselves into a corner and to avoid the ghetto, looking beyond read replicas to full-on quorum read/writes, I can imagine we'd need some means like the above where the hbase:meta row name is not longer the physical name of a region but rather a logical name. The primary region in the quorum in read replicas is the region number 00000 but doing quorum read/writes, the leader will need to be able to change over the life of the quorum. Going forward, all regions get an index? By default the index is zero. When replicas or quorum members, the indices distingush members. When read replicas the region with index 0 is primary. When a quorum, the index has no special meaning. In the past we have had two naming conventions for regions live side by side in the one live cluster. We could do it again.
          Hide
          Devaraj Das added a comment -

          Ok on the timing. You know how I feel about 1.0 – sooner rather than later – but hopefully this feature gets done in time.

          Yeah.. couple of us are on it.

          After thinking more on this, I 'get' why you have the replicas listed inside in the row rather than as rows themselves [in hbase:meta]. The row in hbase:meta becomes a proxy or facade for the little cluster of regions one of which is the primary with the others read replicas.

          That's great. A copy-paste of what I said in the RB on HBASE-10347 for others' reference.
          "I and Enis had debated this as well. The consensus between us was that we don't need to add new META rows for the replicas. After all, the HRI information is exactly the same for all the replicas except for the replicaID. In the current meta, we already have a column for the location of a region. It seemed logical to just extend that model - add newer columns for the replica locations (and similarly for the other columns like seqnum). That way everything for a particular user-visible region stays in one row (and makes it easier for readers to know about all replica locations from that one row). Regarding special casing, yes there is some special casing in the way the regions are added to the meta - create table will create all regions (if the table was created with replica > 1), but only the primary regions will be added to the meta. The regionserver - when it updates the meta with the location after it opens a region invokes the API passing the replicaID as an argument - the column names are different based on whether the replicaID is primary or not. These are pretty much the special cases for the meta updates."

          HRegionInfo now is overloaded. Before it was the info on a specific region. Now it is trying to serve two purposes; its original intent and now too as a descriptor on the region-serving 'cluster' made of a primary and replicas. Lets avoid overloading what up to this has had a clear role in the hbase model.

          By doing it the way we have in the patch on HBASE-10347, it seems to reflect what's going on - "HRI is a logical descriptor and a facade for a bunch of primary & replicas". That's how we store things in the meta and how we reconstruct HRIs from the meta when needed.
          There are possibly other approaches of doing this. E.g. Extend HRegionInfo as, say, HRegionInfoReplica and maintain the information about replicaID there, and/or change all the relevant methods to accept HRegionInfoReplica and potentially return this as well in relevant situations. The issue there is those approaches would be very intrusive and we would still need special cases for replicaID == 0 or not. Not confident how much we would gain there. Is it too much to ask to change the view of what a HRI means (to what you say above). Anyway, let me ponder a bit on this...

          The primary holds the 'pole position' being the name of the region in meta. The read replicas are differently named with the 00001 and 00002, etc., interpolated into the middle of the region name. I suppose doing it this way 'minimizes' the disturbance in the code base but I'm worried this naming exception will only confuse though it minimizes change. Why would the primary not be named like the replica regions?

          I don't mind naming the primary regions similar to the replicas. This might mean tools that currently depend on the name format would break even if the cluster is not deploying tables with replicas (you guessed that response ) But yeah, if you go the full Paxos route, the 'primary' could be anyone in the replica-set and there it makes sense to have all members in the set to have an index.

          Show
          Devaraj Das added a comment - Ok on the timing. You know how I feel about 1.0 – sooner rather than later – but hopefully this feature gets done in time. Yeah.. couple of us are on it. After thinking more on this, I 'get' why you have the replicas listed inside in the row rather than as rows themselves [in hbase:meta] . The row in hbase:meta becomes a proxy or facade for the little cluster of regions one of which is the primary with the others read replicas. That's great. A copy-paste of what I said in the RB on HBASE-10347 for others' reference. "I and Enis had debated this as well. The consensus between us was that we don't need to add new META rows for the replicas. After all, the HRI information is exactly the same for all the replicas except for the replicaID. In the current meta, we already have a column for the location of a region. It seemed logical to just extend that model - add newer columns for the replica locations (and similarly for the other columns like seqnum). That way everything for a particular user-visible region stays in one row (and makes it easier for readers to know about all replica locations from that one row). Regarding special casing, yes there is some special casing in the way the regions are added to the meta - create table will create all regions (if the table was created with replica > 1), but only the primary regions will be added to the meta. The regionserver - when it updates the meta with the location after it opens a region invokes the API passing the replicaID as an argument - the column names are different based on whether the replicaID is primary or not. These are pretty much the special cases for the meta updates." HRegionInfo now is overloaded. Before it was the info on a specific region. Now it is trying to serve two purposes; its original intent and now too as a descriptor on the region-serving 'cluster' made of a primary and replicas. Lets avoid overloading what up to this has had a clear role in the hbase model. By doing it the way we have in the patch on HBASE-10347 , it seems to reflect what's going on - "HRI is a logical descriptor and a facade for a bunch of primary & replicas". That's how we store things in the meta and how we reconstruct HRIs from the meta when needed. There are possibly other approaches of doing this. E.g. Extend HRegionInfo as, say, HRegionInfoReplica and maintain the information about replicaID there, and/or change all the relevant methods to accept HRegionInfoReplica and potentially return this as well in relevant situations. The issue there is those approaches would be very intrusive and we would still need special cases for replicaID == 0 or not. Not confident how much we would gain there. Is it too much to ask to change the view of what a HRI means (to what you say above). Anyway, let me ponder a bit on this... The primary holds the 'pole position' being the name of the region in meta. The read replicas are differently named with the 00001 and 00002, etc., interpolated into the middle of the region name. I suppose doing it this way 'minimizes' the disturbance in the code base but I'm worried this naming exception will only confuse though it minimizes change. Why would the primary not be named like the replica regions? I don't mind naming the primary regions similar to the replicas. This might mean tools that currently depend on the name format would break even if the cluster is not deploying tables with replicas (you guessed that response ) But yeah, if you go the full Paxos route, the 'primary' could be anyone in the replica-set and there it makes sense to have all members in the set to have an index.
          Hide
          stack added a comment -

          ....Is it too much to ask to change the view of what a HRI means

          No. As long as we are clear what we are about.

          It is not clear in HBASE-10347 that HRI's role is changed from description of actual region to instead, description of a region 'cluster'. It needs to be made more plain.

          Regards...

          ...Regarding special casing, yes there is some special casing in the way the regions are added to the meta - create table will create all regions (if the table was created with replica > 1), but only the primary regions will be added to the meta. The regionserver - when it updates the meta with the location after it opens a region invokes the API passing the replicaID as an argument - the column names are different based on whether the replicaID is primary or not. These are pretty much the special cases for the meta updates."...

          I went back to the design doc and it is too high level for me to 'see' the new hbase:meta schema changes and the list of special casings. I think a description of the new hbase:schema would help. A list of the 'special-casings' would help too but might be easier just looking at code. A macro-level list perhaps?

          Is 'replica' and 'replica id' the right term given we want to first put in place mechanisms that can be used for more than just 'read replicas'. Region 'set' and set 'member'?

          This might mean tools that currently depend on the name format would break even if the cluster is not deploying tables with replicas

          Once before, we had two naming schemes in the cluster and via various tricks – whether there is a '.' on the end – it was possible to have the cohabit the same cluster instance. In this case, it might not have to be so radical. We might just relax the regex that looks at encoded name such that if we encounter a region with more than the usual 20 bytes of md5, then those extra bytes are set member index (no need of a delimiter – delimiter messes up your sort which may not be an issue if rows in hbase:meta are no longer the actual names of regions).

          Show
          stack added a comment - ....Is it too much to ask to change the view of what a HRI means No. As long as we are clear what we are about. It is not clear in HBASE-10347 that HRI's role is changed from description of actual region to instead, description of a region 'cluster'. It needs to be made more plain. Regards... ...Regarding special casing, yes there is some special casing in the way the regions are added to the meta - create table will create all regions (if the table was created with replica > 1), but only the primary regions will be added to the meta. The regionserver - when it updates the meta with the location after it opens a region invokes the API passing the replicaID as an argument - the column names are different based on whether the replicaID is primary or not. These are pretty much the special cases for the meta updates."... I went back to the design doc and it is too high level for me to 'see' the new hbase:meta schema changes and the list of special casings. I think a description of the new hbase:schema would help. A list of the 'special-casings' would help too but might be easier just looking at code. A macro-level list perhaps? Is 'replica' and 'replica id' the right term given we want to first put in place mechanisms that can be used for more than just 'read replicas'. Region 'set' and set 'member'? This might mean tools that currently depend on the name format would break even if the cluster is not deploying tables with replicas Once before, we had two naming schemes in the cluster and via various tricks – whether there is a '.' on the end – it was possible to have the cohabit the same cluster instance. In this case, it might not have to be so radical. We might just relax the regex that looks at encoded name such that if we encounter a region with more than the usual 20 bytes of md5, then those extra bytes are set member index (no need of a delimiter – delimiter messes up your sort which may not be an issue if rows in hbase:meta are no longer the actual names of regions).
          Hide
          Enis Soztutar added a comment -

          Thanks Stack for looking into this. Valid feedback.
          I think we can model the desired region model with something like this:

          Region { table, startKey, endKey, regionId, encodedName} 
          RegionReplica {region, replicaId}
          RegionReplicaState {offline, split}
          RegionReplicaLocation {server, seqId}
          RegionLineage { [splitDaughter], [mergeParent] } 
          

          I think the challenge is that, currently HRI and the meta layout + the client level API's will need to be redesigned from scratch if we want to fully switch to this model. Then for example table.getRegions() would only return Region's, but table.getRegionReplicas() would return smt different. The region assignments and everything should switch to being RegionReplica based rather than HRI based.

          If we keep HRI ~= Region + RegionReplicaState, then the region location API's, and assignment have to be managed via RegionReplica objects and the APIs have to be overloaded (since there won't be HRI -> location anymore). Right now what we have is HRI ~= RegionReplica + RegionState. I guess we can spend some time to see whether this is possible without refactoring major portions of the code base, but I fear the answer might be what my intuition says.

          For paxos / quorum case, I think we can keep the special treatment of replicaId == 0 => primary for now. If we later change the write model, then we can have a leader definition, but the leader would not necessarily mean replicaId = 0. Even in that case, we have to differentiate between a server hosting a specific replica which still required static replicaId or similar. The special case where we do not add the replicaId to the string form of HRI is for not requiring a meta + hdfs regioninfo rewrite. I guess we can add it there, but add special case handling for parsing back. Would that work?

          Show
          Enis Soztutar added a comment - Thanks Stack for looking into this. Valid feedback. I think we can model the desired region model with something like this: Region { table, startKey, endKey, regionId, encodedName} RegionReplica {region, replicaId} RegionReplicaState {offline, split} RegionReplicaLocation {server, seqId} RegionLineage { [splitDaughter], [mergeParent] } I think the challenge is that, currently HRI and the meta layout + the client level API's will need to be redesigned from scratch if we want to fully switch to this model. Then for example table.getRegions() would only return Region's, but table.getRegionReplicas() would return smt different. The region assignments and everything should switch to being RegionReplica based rather than HRI based. If we keep HRI ~= Region + RegionReplicaState, then the region location API's, and assignment have to be managed via RegionReplica objects and the APIs have to be overloaded (since there won't be HRI -> location anymore). Right now what we have is HRI ~= RegionReplica + RegionState. I guess we can spend some time to see whether this is possible without refactoring major portions of the code base, but I fear the answer might be what my intuition says. For paxos / quorum case, I think we can keep the special treatment of replicaId == 0 => primary for now. If we later change the write model, then we can have a leader definition, but the leader would not necessarily mean replicaId = 0. Even in that case, we have to differentiate between a server hosting a specific replica which still required static replicaId or similar. The special case where we do not add the replicaId to the string form of HRI is for not requiring a meta + hdfs regioninfo rewrite. I guess we can add it there, but add special case handling for parsing back. Would that work?
          Hide
          stack added a comment -

          You may not have seen my response above where I think we should not talk of replicas in our model all. We should call the region instances something else and then in a layer above do designations such as "primary" or "leader" and "read-only replica" or "follower". If you agree with me, the 'desired region model' you propose in your comment is no longer appropriate.

          Sorry Enis, I don't get your second paragraph above. Pardon me. It is a failing on my part. I'm having trouble untangling what the 'Right now' refers to, whether it is trunk or your work on this feature so far.... and what 'whether this is possible' refers to.

          I guess we can add it there, but add special case handling for parsing back. Would that work?

          I haven't done the research but am thinking that a wholesale move to a new region naming scheme for all new region creations better than a scheme where we have 'replicas' named with one format and everything else another.

          Show
          stack added a comment - You may not have seen my response above where I think we should not talk of replicas in our model all. We should call the region instances something else and then in a layer above do designations such as "primary" or "leader" and "read-only replica" or "follower". If you agree with me, the 'desired region model' you propose in your comment is no longer appropriate. Sorry Enis, I don't get your second paragraph above. Pardon me. It is a failing on my part. I'm having trouble untangling what the 'Right now' refers to, whether it is trunk or your work on this feature so far.... and what 'whether this is possible' refers to. I guess we can add it there, but add special case handling for parsing back. Would that work? I haven't done the research but am thinking that a wholesale move to a new region naming scheme for all new region creations better than a scheme where we have 'replicas' named with one format and everything else another.
          Hide
          Lars Hofhansl added a comment -

          Let's stick with small quick releases. I think that has worked well with 0.94.
          If it's ready we pull it in 1.0, if not we can get it into 1.1 or (depending on size) into 2.0.
          As always, just my $0.02.

          Lastly, are we pushing into a domain that is at odds with HBase? It seems the use case targeted would be the domain of Cassandra and similar (yes, I used the C-Word). Consistent reads ... sometimes ... is weird unless the semantics on when that happens are absolutely crystal clear.

          Show
          Lars Hofhansl added a comment - Let's stick with small quick releases. I think that has worked well with 0.94. If it's ready we pull it in 1.0, if not we can get it into 1.1 or (depending on size) into 2.0. As always, just my $0.02. Lastly, are we pushing into a domain that is at odds with HBase? It seems the use case targeted would be the domain of Cassandra and similar (yes, I used the C-Word). Consistent reads ... sometimes ... is weird unless the semantics on when that happens are absolutely crystal clear.
          Hide
          Devaraj Das added a comment -

          Lars Hofhansl The user specifies he can tolerate the stale reads via flags in the read api. The Results are also tagged as such and so he can inspect whether the result is stale or not. In other words, the user has full control still.

          Show
          Devaraj Das added a comment - Lars Hofhansl The user specifies he can tolerate the stale reads via flags in the read api. The Results are also tagged as such and so he can inspect whether the result is stale or not. In other words, the user has full control still.
          Hide
          Devaraj Das added a comment -

          stack, I agree with you that the notion of "replicaID == 0" being a primary replica, etc. should be maintained in a layer outside HRegionInfo. HRegionInfo could come with an 'index' and the 'index' should be an inherent part of the HRI's identification (in the name, etc.). The layer outside could associate "index == 0" with "primary" replica, etc.. Will submit a patch along these lines.

          Show
          Devaraj Das added a comment - stack , I agree with you that the notion of "replicaID == 0" being a primary replica, etc. should be maintained in a layer outside HRegionInfo. HRegionInfo could come with an 'index' and the 'index' should be an inherent part of the HRI's identification (in the name, etc.). The layer outside could associate "index == 0" with "primary" replica, etc.. Will submit a patch along these lines.
          Hide
          Enis Soztutar added a comment -

          Attaching a more detailed design proposal for the first subtask about HRI + meta changes after the discussions with Stack.

          In the doc, we list two approaches which can be pursued depending on the pros/cons section. We'll continue to modify the patch for HBASE-10347 to see which one is more feasible.

          Please feel free to review + leave comments at the doc.

          Show
          Enis Soztutar added a comment - Attaching a more detailed design proposal for the first subtask about HRI + meta changes after the discussions with Stack. In the doc, we list two approaches which can be pursued depending on the pros/cons section. We'll continue to modify the patch for HBASE-10347 to see which one is more feasible. Please feel free to review + leave comments at the doc.
          Hide
          Enis Soztutar added a comment -

          I've updated the subtask HBASE-10347 with a new patch, and also updated the design doc for adding an alternate third proposal which I think we can iterate on it for the final shape. Please review the patch / doc for whether it makes sense or not.

          Show
          Enis Soztutar added a comment - I've updated the subtask HBASE-10347 with a new patch, and also updated the design doc for adding an alternate third proposal which I think we can iterate on it for the final shape. Please review the patch / doc for whether it makes sense or not.
          Hide
          Devaraj Das added a comment -

          Straightforward patch..

          Show
          Devaraj Das added a comment - Straightforward patch..
          Hide
          Enis Soztutar added a comment -

          Straightforward patch..

          I guess this should go to the subtask HBASE-10348.

          Show
          Enis Soztutar added a comment - Straightforward patch.. I guess this should go to the subtask HBASE-10348 .
          Hide
          Enis Soztutar added a comment -

          I have created the branch in svn since we have some patches ready for commit. The branch is https://svn.apache.org/repos/asf/hbase/branches/hbase-10070, which will be the base for eventual merge back into trunk. The plan is to only commit reviewed patches there with at least 2 +1's. All subtasks of this jira are scheduled to be committed at that branch.

          We might have to do a final rebase, but hopefully that would be only mechanical and would not need any reviews.

          Show
          Enis Soztutar added a comment - I have created the branch in svn since we have some patches ready for commit. The branch is https://svn.apache.org/repos/asf/hbase/branches/hbase-10070 , which will be the base for eventual merge back into trunk. The plan is to only commit reviewed patches there with at least 2 +1's. All subtasks of this jira are scheduled to be committed at that branch. We might have to do a final rebase, but hopefully that would be only mechanical and would not need any reviews.
          Hide
          Enis Soztutar added a comment -

          I also created a version "hbase-10070" to mark the issues that have been committed to the branch. We can set the fix versions of the jira with that patch. Not sure whether it is best practice to resolve those jiras or not. I just left them open for now.
          The first sub task HBASE-10347 is committed at the branch. HBASE-10348 and HBASE-10354 will shortly follow.

          Show
          Enis Soztutar added a comment - I also created a version "hbase-10070" to mark the issues that have been committed to the branch. We can set the fix versions of the jira with that patch. Not sure whether it is best practice to resolve those jiras or not. I just left them open for now. The first sub task HBASE-10347 is committed at the branch. HBASE-10348 and HBASE-10354 will shortly follow.
          Hide
          Devaraj Das added a comment -

          Let's resolve the subtask issues after commit to branch.

          Show
          Devaraj Das added a comment - Let's resolve the subtask issues after commit to branch.
          Hide
          Sergey Shelukhin added a comment -

          I did just that w/HBASE-10356. Don't forget to change fix version though

          Show
          Sergey Shelukhin added a comment - I did just that w/ HBASE-10356 . Don't forget to change fix version though
          Hide
          Enis Soztutar added a comment -

          Let's resolve the subtask issues after commit to branch.

          Ok. I resolved the issues that were committed to hbase-10070, and changed the fix versions to only include the branch name. Once we merge the branch to trunk, we can add the fix versions.

          Show
          Enis Soztutar added a comment - Let's resolve the subtask issues after commit to branch. Ok. I resolved the issues that were committed to hbase-10070, and changed the fix versions to only include the branch name. Once we merge the branch to trunk, we can add the fix versions.
          Hide
          Jonathan Hsieh added a comment -

          I also created a version "hbase-10070" to mark the issues that have been committed to the branch. We can set the fix versions of the jira with that patch. Not sure whether it is best practice to resolve those jiras or not.

          Sounds good to me. When we did snapshots, we committed and resolved the affects and target version to the branch version. When we successfully merged we changed it to the normal version number (back then 0.95). As an example, take a look at HBASE-7339's history.

          Show
          Jonathan Hsieh added a comment - I also created a version "hbase-10070" to mark the issues that have been committed to the branch. We can set the fix versions of the jira with that patch. Not sure whether it is best practice to resolve those jiras or not. Sounds good to me. When we did snapshots, we committed and resolved the affects and target version to the branch version. When we successfully merged we changed it to the normal version number (back then 0.95). As an example, take a look at HBASE-7339 's history.
          Hide
          Enis Soztutar added a comment -

          Thanks Jonathan. We'll use the same approach here. 8 issues resolved already.

          Show
          Enis Soztutar added a comment - Thanks Jonathan. We'll use the same approach here. 8 issues resolved already.
          Hide
          Tianying Chang added a comment -

          Enis Soztutar Is the 3rd option of Async WAL replication feature already implemented?

          Show
          Tianying Chang added a comment - Enis Soztutar Is the 3rd option of Async WAL replication feature already implemented?
          Hide
          Enis Soztutar added a comment -

          Enis Soztutar Is the 3rd option of Async WAL replication feature already implemented?

          Not yet, we are finishing the phase 1 as identified in the doc. Only the scan support is left for that. Afterwards, we plan to continue on implementing the asyn wal approach with a design proposal of it's own. Are you interested in using async wal replication separately, or you are asking whether the feature as a whole is available?

          Show
          Enis Soztutar added a comment - Enis Soztutar Is the 3rd option of Async WAL replication feature already implemented? Not yet, we are finishing the phase 1 as identified in the doc. Only the scan support is left for that. Afterwards, we plan to continue on implementing the asyn wal approach with a design proposal of it's own. Are you interested in using async wal replication separately, or you are asking whether the feature as a whole is available?
          Hide
          Tianying Chang added a comment -

          Enis Soztutar I am interested in the feature as a whole. I can see Get a single row is already supported. How about the get(list<Get>)? I seems cannot find it. Will that be supported later together with scan in phase 1?

          Show
          Tianying Chang added a comment - Enis Soztutar I am interested in the feature as a whole. I can see Get a single row is already supported. How about the get(list<Get>)? I seems cannot find it. Will that be supported later together with scan in phase 1?
          Hide
          Sergey Shelukhin added a comment -

          list get is also supported, although some changes need to be committed to address corner cases, like HBASE-10794

          Show
          Sergey Shelukhin added a comment - list get is also supported, although some changes need to be committed to address corner cases, like HBASE-10794
          Hide
          Tianying Chang added a comment -

          I am using Enis git branch of 10070, it has commit until Jan 30. That is probably why I don't see it? I checked the 10070-working branch, it has up to Feb 13, seems still not update enough. Do you know which branch has the multi get support?

          Show
          Tianying Chang added a comment - I am using Enis git branch of 10070, it has commit until Jan 30. That is probably why I don't see it? I checked the 10070-working branch, it has up to Feb 13, seems still not update enough. Do you know which branch has the multi get support?
          Hide
          Ted Yu added a comment -
          Show
          Ted Yu added a comment - Tianying, here is the branch: https://svn.apache.org/repos/asf/hbase/branches/hbase-10070
          Hide
          Tianying Chang added a comment -

          Got it. Thanks Ted Yu

          Show
          Tianying Chang added a comment - Got it. Thanks Ted Yu
          Hide
          Enis Soztutar added a comment -

          Changed issue title, according to HBASE-10354.

          Show
          Enis Soztutar added a comment - Changed issue title, according to HBASE-10354 .
          Hide
          Mikhail Antonov added a comment -

          Guys, sorry for coming to this topic really late, but there're some considerations I'd like to bring up (I'm looking at whatever is committed currently in the branch).

          I think about the client API changes needed for consistency (HBASE-10354) in light of the consensus-based effort which aims to having multiple active replicas, and how to make those 2 approaches smoothly work together so multiple active replicas could be built just upon this solid foundation, perhaps there may be more flexible alternative.

          Instead of defining enum Consistency

          {strong, timeline}

          and hooking into Get and Scan, and defining several possible internal strategies on how to send RPCs based on that ("primary timeout", "parallel", "parallel with delay" ) may be we can define pluggable strategy on how to execute RPCs? Similar to HDFS FailoverProxyProvider, which can be defined in the client's config.

          This way we can use pluggable:

          • "no-op" provider, to have behavior like what we have now in trunk
          • timeline provider, which would work as described here
          • provider which can work with multiple active replicas and round-robin between them if some of them fail.

          Thoughts?

          Show
          Mikhail Antonov added a comment - Guys, sorry for coming to this topic really late, but there're some considerations I'd like to bring up (I'm looking at whatever is committed currently in the branch). I think about the client API changes needed for consistency ( HBASE-10354 ) in light of the consensus-based effort which aims to having multiple active replicas, and how to make those 2 approaches smoothly work together so multiple active replicas could be built just upon this solid foundation, perhaps there may be more flexible alternative. Instead of defining enum Consistency {strong, timeline} and hooking into Get and Scan, and defining several possible internal strategies on how to send RPCs based on that ("primary timeout", "parallel", "parallel with delay" ) may be we can define pluggable strategy on how to execute RPCs? Similar to HDFS FailoverProxyProvider, which can be defined in the client's config. This way we can use pluggable: "no-op" provider, to have behavior like what we have now in trunk timeline provider, which would work as described here provider which can work with multiple active replicas and round-robin between them if some of them fail. Thoughts?
          Hide
          Konstantin Boudnik added a comment -

          Can I ask what definition of timeline consistency are we using here? I am a little bit uncomfortable with it as I don't know what exactly is implied by this term.

          To explain further on this: in case of a consensus based replication (described in the Design Document attached to HBASE-10909) we are claiming that all writable active replicas are one copy equivalent or strong consistency across replication that reached the same GSN. In case of this JIRA, the strong consistency with just a single writable replica (and no RO slaves) has the same semantic. I believe by providing a pluggable fail-over policy implementation we will guarantee that strong consistency in case of a consensus based replication has the same semantical meaning as in case of HBASE-10070. In other words, we'll provide the implementation of the semantic instead of a documentation of a such.

          Relaying to an earlier Stack's comment:

          Pardon all the questions. I am concerned that a prime directive, consistent view, is being softened. As is, its easy saying what we are. Going forward, lets not get to a spot where we have to answer "It is complicated..." when asked if we are a consistent store or not

          shall we try to provide a harder consistency guarantees, while covering the weaker ones en-route?

          Show
          Konstantin Boudnik added a comment - Can I ask what definition of timeline consistency are we using here? I am a little bit uncomfortable with it as I don't know what exactly is implied by this term. To explain further on this: in case of a consensus based replication (described in the Design Document attached to HBASE-10909 ) we are claiming that all writable active replicas are one copy equivalent or strong consistency across replication that reached the same GSN. In case of this JIRA, the strong consistency with just a single writable replica (and no RO slaves) has the same semantic. I believe by providing a pluggable fail-over policy implementation we will guarantee that strong consistency in case of a consensus based replication has the same semantical meaning as in case of HBASE-10070 . In other words, we'll provide the implementation of the semantic instead of a documentation of a such. Relaying to an earlier Stack's comment: Pardon all the questions. I am concerned that a prime directive, consistent view, is being softened. As is, its easy saying what we are. Going forward, lets not get to a spot where we have to answer "It is complicated..." when asked if we are a consistent store or not shall we try to provide a harder consistency guarantees, while covering the weaker ones en-route?
          Hide
          Devaraj Das added a comment -

          Mikhail Antonov yes, we did think about a pluggable client provider. We put that as future work since we wanted to focus on implementing all the bits and pieces for the timeline consistency first.

          @cos, the timeline definition is here - HBASE-10513. We already have the timeline stuff working (and tested). We could already hide the fact that a consensus based strong consistency (when that is available) is being used by the servers behind the default consistency level which is STRONG.

          In summary, yes, we should consider a pluggable provider but I think that would be an incremental step from what has been done.

          What has been done so far addresses timeline consistency. I also look at the work we have done as building blocks for consensus based strong consistency.

          Show
          Devaraj Das added a comment - Mikhail Antonov yes, we did think about a pluggable client provider. We put that as future work since we wanted to focus on implementing all the bits and pieces for the timeline consistency first. @cos, the timeline definition is here - HBASE-10513 . We already have the timeline stuff working (and tested). We could already hide the fact that a consensus based strong consistency (when that is available) is being used by the servers behind the default consistency level which is STRONG. In summary, yes, we should consider a pluggable provider but I think that would be an incremental step from what has been done. What has been done so far addresses timeline consistency. I also look at the work we have done as building blocks for consensus based strong consistency.
          Hide
          Andrew Purtell added a comment -

          Can I ask what definition of timeline consistency are we using here?

          I believe the Yahoo PNUTS paper is where many have first heard the term 'timeline consistency'. Daniel Abadi summarizes it at http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html as "where replicas may not be consistent with each other but updates are guaranteed to be applied in the same order at all replicas", which I think is very concise.

          If we claim this about HBASE-10070, is that accurate? (I think yes.)

          If so, the documentation committed on HBASE-10513 doesn't summarize the term as cleanly in my opinion. Might be good to update the lead in to the timeline consistency part of the doc.

          Show
          Andrew Purtell added a comment - Can I ask what definition of timeline consistency are we using here? I believe the Yahoo PNUTS paper is where many have first heard the term 'timeline consistency'. Daniel Abadi summarizes it at http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html as "where replicas may not be consistent with each other but updates are guaranteed to be applied in the same order at all replicas", which I think is very concise. If we claim this about HBASE-10070 , is that accurate? (I think yes.) If so, the documentation committed on HBASE-10513 doesn't summarize the term as cleanly in my opinion. Might be good to update the lead in to the timeline consistency part of the doc.
          Hide
          stack added a comment -

          and hooking into Get and Scan, and defining several possible internal strategies on how to send RPCs based on that ("primary timeout", "parallel", "parallel with delay" ) may be we can define pluggable strategy on how to execute RPCs? Similar to HDFS FailoverProxyProvider, which can be defined in the client's config.

          Mikhail Antonov So, rather than have the client ask for level of 'consistency' in the API, instead, the replica interaction would be set on client construction dependent on the plugin supplied?

          In the API at the moment we have STRONG and TIMELINE (What happens if I ask for TIMELINE and cluster is not deployed with read replicas? Ignored?). If we were to add QUORUM_STRONG, are we thinking that a client should be able to choose amongst these options? Will that fly? At the moment, as noted, we have amended Get and Scan. We'll have to amend all ops if we follow the path of HBASE-10513?

          How hard to evolve from HBASE-10513 to Mikhail Antonov suggestion?

          Show
          stack added a comment - and hooking into Get and Scan, and defining several possible internal strategies on how to send RPCs based on that ("primary timeout", "parallel", "parallel with delay" ) may be we can define pluggable strategy on how to execute RPCs? Similar to HDFS FailoverProxyProvider, which can be defined in the client's config. Mikhail Antonov So, rather than have the client ask for level of 'consistency' in the API, instead, the replica interaction would be set on client construction dependent on the plugin supplied? In the API at the moment we have STRONG and TIMELINE (What happens if I ask for TIMELINE and cluster is not deployed with read replicas? Ignored?). If we were to add QUORUM_STRONG, are we thinking that a client should be able to choose amongst these options? Will that fly? At the moment, as noted, we have amended Get and Scan. We'll have to amend all ops if we follow the path of HBASE-10513 ? How hard to evolve from HBASE-10513 to Mikhail Antonov suggestion?
          Hide
          Enis Soztutar added a comment -

          I believe the Yahoo PNUTS paper is where many have first heard the term 'timeline consistency'. Daniel Abadi summarizes it at http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html as "where replicas may not be consistent with each other but updates are guaranteed to be applied in the same order at all replicas", which I think is very concise. .... Might be good to update the lead in to the timeline consistency part of the doc.

          Agreed. Yes, the name was inspired by the PNUTS model. I like the concise form. I'll add it to the doc.

          we can define pluggable strategy on how to execute RPCs

          The Consistency enum is mostly concerned about semantics, while execution layer (RPC) is concerned about latency + performance. Even within a given Consistency model, you may want different execution strategies I think (like for TIMELINE consistency, parallel and parallel with delay, or go to first replica, then second, then third, etc). In the committed code in branch, the consistency model implies hard coded execution model.

          What happens if I ask for TIMELINE and cluster is not deployed with read replicas? Ignored

          Good question. The execution strategy for TIMELINE replicas is to do the primary RPC first, then if no response after some delay (10ms by default) do the RPCs for secondaries. If the region has only 1 replica, there won't be any RPC, so it will just wait for the primary RPC to response back.

          Show
          Enis Soztutar added a comment - I believe the Yahoo PNUTS paper is where many have first heard the term 'timeline consistency'. Daniel Abadi summarizes it at http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html as "where replicas may not be consistent with each other but updates are guaranteed to be applied in the same order at all replicas", which I think is very concise. .... Might be good to update the lead in to the timeline consistency part of the doc. Agreed. Yes, the name was inspired by the PNUTS model. I like the concise form. I'll add it to the doc. we can define pluggable strategy on how to execute RPCs The Consistency enum is mostly concerned about semantics, while execution layer (RPC) is concerned about latency + performance. Even within a given Consistency model, you may want different execution strategies I think (like for TIMELINE consistency, parallel and parallel with delay, or go to first replica, then second, then third, etc). In the committed code in branch, the consistency model implies hard coded execution model. What happens if I ask for TIMELINE and cluster is not deployed with read replicas? Ignored Good question. The execution strategy for TIMELINE replicas is to do the primary RPC first, then if no response after some delay (10ms by default) do the RPCs for secondaries. If the region has only 1 replica, there won't be any RPC, so it will just wait for the primary RPC to response back.
          Hide
          Vladimir Rodionov added a comment -

          Can somebody explain (in a few sentences) how does the approach, described in this JIRA, differ from the following one:

          1. Have master cluster for writes/reads
          2. Have secondary cluster for reads only
          3. Set up replication master-> secondary
          4. Keep replication lag under control.
          5. Add secondary cluster support to HBase client library.

          From the H/W usage point of view both approaches require additional resources anyway.

          Show
          Vladimir Rodionov added a comment - Can somebody explain (in a few sentences) how does the approach, described in this JIRA, differ from the following one: 1. Have master cluster for writes/reads 2. Have secondary cluster for reads only 3. Set up replication master-> secondary 4. Keep replication lag under control. 5. Add secondary cluster support to HBase client library. From the H/W usage point of view both approaches require additional resources anyway.
          Hide
          Devaraj Das added a comment - - edited

          VladRodionov, this solution is complementary to the multi-dc HA. It addresses regionserver failures within a single cluster. Not sure if it was obvious or not, but this solution doesn't introduce more copies of hfiles in the cluster. There is still a single copy of the hfiles in a given cluster. The "Tradeoffs" section in HBASE-10513 talks about the other resource overheads added if this solution is used.

          Show
          Devaraj Das added a comment - - edited VladRodionov , this solution is complementary to the multi-dc HA. It addresses regionserver failures within a single cluster. Not sure if it was obvious or not, but this solution doesn't introduce more copies of hfiles in the cluster. There is still a single copy of the hfiles in a given cluster. The "Tradeoffs" section in HBASE-10513 talks about the other resource overheads added if this solution is used.
          Hide
          Mikhail Antonov added a comment -

          Devaraj Das, stack, Enis Soztutar

          Yeah, the reason I brought it up is that unlike changes for example in LB, this is public (yet evolving) API, so just wanted to double-check that we don't expose to client code details which would limit us later.

          Even within a given Consistency model, you may want different execution strategies I think (like for TIMELINE consistency, parallel and parallel with delay, or go to first replica, then second, then third, etc). In the committed code in branch, the consistency model implies hard coded execution model.

          Sure, any consistency model (except current behavior i guess) would benefit from being customizable.

          So, rather than have the client ask for level of 'consistency' in the API, instead, the replica interaction would be set on client construction dependent on the plugin supplied?

          Either "rather" or "both", I guess. If we could say that level of consistency (strong, timeline or quorum-strong) could be defined in config files per-client (not per-operations), we would be able to avoid having this enum. But we consider that being able to define consistency level per-operation is mandatory, right?

          In that case I'm thinking of the following model:

          • deploy pluggable policy at client side which which decide on RPC requests, this policy would be used globally for all requests as default
          • consider the Consistency enum (and point it out in both user and dev level docs) as a "hint", used only to be able to customize individual scans or gets, and probably add note in class documentation that cluster may ignore the flag set if the feature isn't available?
          • current timeline consistency model doesn't assume quorums for write, so I think it makes sense to add QUORUM_STRONG in enum.

          Thoughts?

          Show
          Mikhail Antonov added a comment - Devaraj Das , stack , Enis Soztutar Yeah, the reason I brought it up is that unlike changes for example in LB, this is public (yet evolving) API, so just wanted to double-check that we don't expose to client code details which would limit us later. Even within a given Consistency model, you may want different execution strategies I think (like for TIMELINE consistency, parallel and parallel with delay, or go to first replica, then second, then third, etc). In the committed code in branch, the consistency model implies hard coded execution model. Sure, any consistency model (except current behavior i guess) would benefit from being customizable. So, rather than have the client ask for level of 'consistency' in the API, instead, the replica interaction would be set on client construction dependent on the plugin supplied? Either "rather" or "both", I guess. If we could say that level of consistency (strong, timeline or quorum-strong) could be defined in config files per-client (not per-operations), we would be able to avoid having this enum. But we consider that being able to define consistency level per-operation is mandatory, right? In that case I'm thinking of the following model: deploy pluggable policy at client side which which decide on RPC requests, this policy would be used globally for all requests as default consider the Consistency enum (and point it out in both user and dev level docs) as a "hint", used only to be able to customize individual scans or gets, and probably add note in class documentation that cluster may ignore the flag set if the feature isn't available? current timeline consistency model doesn't assume quorums for write, so I think it makes sense to add QUORUM_STRONG in enum. Thoughts?
          Hide
          Enis Soztutar added a comment -

          But we consider that being able to define consistency level per-operation is mandatory, right?

          Yes.

          deploy pluggable policy at client side which which decide on RPC requests, this policy would be used globally for all requests as default

          Right now there is no alternate implementation for normal RPCs, and only 1 model for TIMELINE RPCs. When we have alternate implementations, we can make it pluggable (and configurable per operation).

          current timeline consistency model doesn't assume quorums for write, so I think it makes sense to add QUORUM_STRONG in enum.

          I don't like adding this now. Once we have a corresponding implementation for the proposed quorum write we can add it to the enum later.

          Show
          Enis Soztutar added a comment - But we consider that being able to define consistency level per-operation is mandatory, right? Yes. deploy pluggable policy at client side which which decide on RPC requests, this policy would be used globally for all requests as default Right now there is no alternate implementation for normal RPCs, and only 1 model for TIMELINE RPCs. When we have alternate implementations, we can make it pluggable (and configurable per operation). current timeline consistency model doesn't assume quorums for write, so I think it makes sense to add QUORUM_STRONG in enum. I don't like adding this now. Once we have a corresponding implementation for the proposed quorum write we can add it to the enum later.
          Hide
          Vladimir Rodionov added a comment -

          In the present HBase architecture, it is hard, probably impossible, to satisfy constraints like 99th percentile of the reads will be served under 10 ms

          I did some quick math. To blame bottom 1% on RS down event, means, that, on average (with 1 min MTTR), we should see ~ 14-15 RS down events per cluster per day (1440 minutes). I think it is way above what we have in a real life. Just a quick math. I am not saying that this is not worth doing, but it will not give us 10ms 99% (this is what I am pretty sure about). Just saying. This is only for RS down type of failures. We all know that HBase cluster may experience other types of temporary disabilities, which may affect read request latency:

          • blocked writes under heavy load (probably reads as well?) - not sure. Solution : tune configuration and throttle incoming requests
          • blocked reads due to blocked writes (no available handlers to serve incoming requests). Solution: have different pools for write/reads or use priority on RPC requests (new feature, correct?)
          • excessive GC (sometimes) - Solution: off heap, off heap, off heap.
          • something else, I forgot or not aware about?

          but all of these should be and must avoided in a properly configured and tuned cluster.

          So, this is basically, to mitigate serious events (RS down) , not transient ones. To improve read request latency distribution there is one classic solution that works for sure - cache, cache, cache.

          Show
          Vladimir Rodionov added a comment - In the present HBase architecture, it is hard, probably impossible, to satisfy constraints like 99th percentile of the reads will be served under 10 ms I did some quick math. To blame bottom 1% on RS down event, means, that, on average (with 1 min MTTR), we should see ~ 14-15 RS down events per cluster per day (1440 minutes). I think it is way above what we have in a real life. Just a quick math. I am not saying that this is not worth doing, but it will not give us 10ms 99% (this is what I am pretty sure about). Just saying. This is only for RS down type of failures. We all know that HBase cluster may experience other types of temporary disabilities, which may affect read request latency: blocked writes under heavy load (probably reads as well?) - not sure. Solution : tune configuration and throttle incoming requests blocked reads due to blocked writes (no available handlers to serve incoming requests). Solution: have different pools for write/reads or use priority on RPC requests (new feature, correct?) excessive GC (sometimes) - Solution: off heap, off heap, off heap. something else, I forgot or not aware about? but all of these should be and must avoided in a properly configured and tuned cluster. So, this is basically, to mitigate serious events (RS down) , not transient ones. To improve read request latency distribution there is one classic solution that works for sure - cache, cache, cache.
          Hide
          Mikhail Antonov added a comment -

          Enis Soztutar yep, sounds good to me. Adding new value in enum should be backward-compatible.

          Show
          Mikhail Antonov added a comment - Enis Soztutar yep, sounds good to me. Adding new value in enum should be backward-compatible.
          Hide
          Vladimir Rodionov added a comment -

          May be my point was not clear ... I agree that read HA (how many 9's by the way) is good feature to have, but this won't give us what the developer declared in the description section. The relatively high MTTR is not the major source of a bad 90%+ request latency in HBase.

          Show
          Vladimir Rodionov added a comment - May be my point was not clear ... I agree that read HA (how many 9's by the way) is good feature to have, but this won't give us what the developer declared in the description section. The relatively high MTTR is not the major source of a bad 90%+ request latency in HBase.
          Hide
          Konstantin Boudnik added a comment -

          but it will not give us 10ms 99%

          In reality, nobody with a skin in the game is seriously considering 99% availability. Even 99.999% isn't suitable for some. Would you consider 5.7 minutes annualized to be a plausible danger?

          Show
          Konstantin Boudnik added a comment - but it will not give us 10ms 99% In reality, nobody with a skin in the game is seriously considering 99% availability. Even 99.999% isn't suitable for some. Would you consider 5.7 minutes annualized to be a plausible danger?
          Hide
          Vladimir Rodionov added a comment -

          I apologize Enis Soztutar, I call you - developer .

          Show
          Vladimir Rodionov added a comment - I apologize Enis Soztutar , I call you - developer .
          Hide
          Vladimir Rodionov added a comment -

          In reality, nobody with a skin in the game is seriously considering 99% availability. Even 99.999% isn't suitable for some. Would you consider 5.7 minutes annualized to be a plausible danger?

          How does this contradict my statement? HA is great feature, but it has nothing to do with improving read requests latency distribution.

          Show
          Vladimir Rodionov added a comment - In reality, nobody with a skin in the game is seriously considering 99% availability. Even 99.999% isn't suitable for some. Would you consider 5.7 minutes annualized to be a plausible danger? How does this contradict my statement? HA is great feature, but it has nothing to do with improving read requests latency distribution.
          Hide
          Andrey Stepachev added a comment -

          If you can spread read load of hot region across shadows - read latencies can go down due of less contention, especially in case avoiding reading from primary rs.

          Show
          Andrey Stepachev added a comment - If you can spread read load of hot region across shadows - read latencies can go down due of less contention, especially in case avoiding reading from primary rs.
          Hide
          Mikhail Antonov added a comment -

          HA is great feature, but it has nothing to do with improving read requests latency distribution.

          In this jira we're saying that we can't provide good latency in 99 (.999?) % cases for the following (but not limited, as there're also GC pauses etc) reason: when region replica fails (RS fails) the requests time out, or just take really long time. And this feature is addressing this aspect. So this jira aims (as I understand) to give HA (possibly stale) replicas, with added benefit of reduced latency.

          Show
          Mikhail Antonov added a comment - HA is great feature, but it has nothing to do with improving read requests latency distribution. In this jira we're saying that we can't provide good latency in 99 (.999?) % cases for the following (but not limited, as there're also GC pauses etc) reason: when region replica fails (RS fails) the requests time out, or just take really long time. And this feature is addressing this aspect. So this jira aims (as I understand) to give HA (possibly stale) replicas, with added benefit of reduced latency.
          Hide
          Vladimir Rodionov added a comment -

          What is the current status of this feature? Is it on hold? Abandoned because of FB HydraBase https://issues.apache.org/jira/browse/HBASE-12259 being ported to HBase?

          Show
          Vladimir Rodionov added a comment - What is the current status of this feature? Is it on hold? Abandoned because of FB HydraBase https://issues.apache.org/jira/browse/HBASE-12259 being ported to HBase?
          Hide
          Devaraj Das added a comment -

          As commented here - https://issues.apache.org/jira/browse/HBASE-11183?focusedCommentId=14221856&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14221856, we did implement the second phase of the work (most of the unresolved jiras here pertain to that). We are in the process of porting the patches to master and 1.1 branches. As it stands, this addresses different use cases than what HydraBase addresses (and yes, we should document the differences in the two).

          Show
          Devaraj Das added a comment - As commented here - https://issues.apache.org/jira/browse/HBASE-11183?focusedCommentId=14221856&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14221856 , we did implement the second phase of the work (most of the unresolved jiras here pertain to that). We are in the process of porting the patches to master and 1.1 branches. As it stands, this addresses different use cases than what HydraBase addresses (and yes, we should document the differences in the two).
          Hide
          Vladimir Rodionov added a comment -

          Thank you, Devaraj Das.

          Show
          Vladimir Rodionov added a comment - Thank you, Devaraj Das .
          Hide
          Sean Busbey added a comment -

          what's current status on this, specifically in branch-1 for the 1.2 release?

          Show
          Sean Busbey added a comment - what's current status on this, specifically in branch-1 for the 1.2 release?
          Hide
          Devaraj Das added a comment - - edited

          I think all the planned work is complete on this Sean Busbey and all the code is there in 1.1. We need to update the docs to reflect the current state though.

          Show
          Devaraj Das added a comment - - edited I think all the planned work is complete on this Sean Busbey and all the code is there in 1.1. We need to update the docs to reflect the current state though.
          Hide
          Sean Busbey added a comment -

          Great! Can we get a jira going for the docs work? We can work towards getting it done while 1.2 RCs are in-progress.

          Show
          Sean Busbey added a comment - Great! Can we get a jira going for the docs work? We can work towards getting it done while 1.2 RCs are in-progress.
          Hide
          Enis Soztutar added a comment -

          Great! Can we get a jira going for the docs work? We can work towards getting it done while 1.2 RCs are in-progress.

          See HBASE-13973. Will resolve this one once that is in. It is the last remaining item.

          Show
          Enis Soztutar added a comment - Great! Can we get a jira going for the docs work? We can work towards getting it done while 1.2 RCs are in-progress. See HBASE-13973 . Will resolve this one once that is in. It is the last remaining item.
          Hide
          Enis Soztutar added a comment -

          All subtasks done for some time. Time to resolve this. Thanks everyone.

          Show
          Enis Soztutar added a comment - All subtasks done for some time. Time to resolve this. Thanks everyone.

            People

            • Assignee:
              Enis Soztutar
              Reporter:
              Enis Soztutar
            • Votes:
              11 Vote for this issue
              Watchers:
              90 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development