HBase
  1. HBase
  2. HBASE-4114

Metrics for HFile HDFS block locality

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.92.0
    • Component/s: metrics, regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Normally, when we put hbase and HDFS in the same cluster ( e.g., region server runs on the datenode ), we have a reasonably good data locality, as explained by Lars. Also Work has been done by Jonathan to address the startup situation.

      There are scenarios where regions can be on a different machine from the machines that hold the underlying HFile blocks, at least for some period of time. This will have performance impact on whole table scan operation and map reduce job during that time.

      1. After load balancer moves the region and before compaction (thus generate HFile on the new region server ) on that region, HDFS block can be remote.
      2. When a new machine is added, or removed, Hbase's region assignment policy is different from HDFS's block reassignment policy.
      3. Even if there is no much hbase activity, HDFS can load balance HFile blocks as other non-hbase applications push other data to HDFS.

      Lots has been or will be done in load balancer, as summarized by Ted. I am curious if HFile HDFS block locality should be used as another factor here.

      I have done some experiments on how HDFS block locality can impact map reduce latency. First we need to define a metrics to measure HFile data locality.

      Metrics defintion:

      For a given table, or a region server, or a region, we can define the following. The higher the value, the more local HFile is from region server's point of view.

      HFile locality index = ( Total number of HDFS blocks that can be retrieved locally by the region server ) / ( Total number of HDFS blocks for all HFiles )

      Test Results:
      This is to show how HFile locality can impact the latency. It is based on a table with 1M rows, 36KB per row; regions are distributed in balance. The map job is RowCounter.

      HFile Locality Index Map job latency ( in sec )
      28% 157
      36% 150
      47% 142
      61% 133
      73% 122
      89% 103
      99% 95

      So the first suggestion is to expose HFile locality index as a new region server metrics. It will be ideal if we can somehow measure HFile locality index on a per map job level.

      Regarding if/when we should include that as another factor for load balancer, that will be a different work item. It is unclear how load balancer can take various factors into account to come up with the best load balancer strategy.

      1. HBASE-4114-trunk.patch
        43 kB
        Ming Ma
      2. HBASE-4114-trunk.patch
        43 kB
        Ming Ma
      3. HBASE-4114-trunk.patch
        43 kB
        Ming Ma
      4. HBASE-4114-trunk.patch
        43 kB
        Ming Ma
      5. HBASE-4114-trunk.patch
        44 kB
        Ming Ma
      6. HBASE-4114-trunk.patch
        44 kB
        Ming Ma

        Activity

        Hide
        stack added a comment -

        This is great stuff Ming. Let me see if I understand your index properly:

        HFile locality index = ( Total number of HDFS blocks that can be retrieved locally by the region server ) / ( Total number of HDFS blocks for all HFiles )
        

        Is that 'Total number of HDFS blocks that can be retrieved locally by the region server'.. for all of the hfiles the current rowcounter mapreduce job is reading?

        And 'Total number of HDFS blocks for all HFiles'... the current rowcounter mapreduce job is reading?

        Should this new fancy locality index, should it be done per region, rather than per hfile?

        I can imagine measuring per hfile is easier and secondly, it allows you demonstrate the importance of locality (as per above with your little latency table).

        How will you calculate the index? Will it require a query against the namenode for every file under a region? Can you get the info you need on region info or would it be background process?

        Good stuff.

        Show
        stack added a comment - This is great stuff Ming. Let me see if I understand your index properly: HFile locality index = ( Total number of HDFS blocks that can be retrieved locally by the region server ) / ( Total number of HDFS blocks for all HFiles ) Is that 'Total number of HDFS blocks that can be retrieved locally by the region server'.. for all of the hfiles the current rowcounter mapreduce job is reading? And 'Total number of HDFS blocks for all HFiles'... the current rowcounter mapreduce job is reading? Should this new fancy locality index, should it be done per region, rather than per hfile? I can imagine measuring per hfile is easier and secondly, it allows you demonstrate the importance of locality (as per above with your little latency table). How will you calculate the index? Will it require a query against the namenode for every file under a region? Can you get the info you need on region info or would it be background process? Good stuff.
        Hide
        Ted Yu added a comment -

        Maybe in phase 2, efficiently calculating locality index per region for a given region server would be useful for load balancer to determine the best region server for the underlying region.

        Show
        Ted Yu added a comment - Maybe in phase 2, efficiently calculating locality index per region for a given region server would be useful for load balancer to determine the best region server for the underlying region.
        Hide
        Ming Ma added a comment -

        Thanks, Stack, Ted.

        1. In the experiment table above, the "total number of HDFS blocks that can be retrieved locally by the region server" as well as "total number of HDFS blocks for all HFiles" are defined on the whole cluster level. The external program also calculates locality information per hfile, region as well as per region server. It uses HDFS namenode and the calculation is independent of any map reduce jobs.

        2. In terms of how we can calculate this metrics inside hbase, we can do in two steps. first one is to calcluate the metrics independent of map reduce jobs; the second one is to calcuate it on per map reduce job level.

        3. Calculate on the locality index, independent of map reduce jobs.

        a. It will first be calcuated on hfile level

        { total # of HDFS block, total # of local HDFS blocks }

        ; then the data get aggregated on region level, finally get aggregated on region server level.

        b. Impact on namenode. There are 2 RPC calls to NN to get block info for each hfile. If we assume 100 regions per RS, 10 hfiles per region, 500 RSs, we will have 1M RPC hits to NN. Most of the time, that won't be an issue if we only calculate hfile locality index when hfile is created or region is loaded by the RS the first time. Because HDFS can still move HDFS blocks around without hbase knowing it, we still need to refresh the value periodically.

        c. The computation can be done in RS or HMaster. It seems RS is better in terms of design(only store knows the HDFS path for hfile location, HMaster doesn't) and extensiblity(to calculate locality index per map reduce job). The locality index can be part of HServerLoad and RegionLoad for load balancer to use. RS will rotate through all regions periodically in its main thread. The calcuation interval defined by by "hbase.regionserver.msginterval" might be too short for this scenario to minimize the load to NN for a large cluster ( 20 NN RPC per RS per 3 sec ).

        d. The locality index can be a new RS metrics. We can also put it on table.jsp for each region.

        e. HRegionInfo is kind of static. It doesn't change over time, however, locality index changes overtime for a given region. Maybe ClusterStatus/HServerInfo/HServerLoad/RegionLoad are better?

        4. Locality index calculation for scan / map reduce job.

        a. The original scenario is for full table scan only. If we want to provide accurate locality index for any scan / map reduce, this could be tricky given i) map reduce job can have start/end keys and filters such as time range; ii) block cache can be used and thus hfile shouldn't be accounted for if there is cache hit. iii) data volume read from HDFS block is also a factor. Reading smaller buffer is different from reading bigger buffer.

        b. One useful scenario is we want to find out why map jobs run slower sometimes. So it is useful the metrics is just there as part of map reduce job status page. We can estimate by using ganglia page to get the locality index value for the RSs at the time map reduce job is run.

        c. To provide more accurate data, we can modify TableInputFormat, a) call HBaseAdmin.getClusterStatus to get the locality index info for each region. b) calculate the intersection between scan specification and ClusterStatus based on key range as well as column family. It isn't 100% accurate, but it might be good enough.

        d. To be really accurate, region server needs to provide locality index for each scan operation back to the client.

        Show
        Ming Ma added a comment - Thanks, Stack, Ted. 1. In the experiment table above, the "total number of HDFS blocks that can be retrieved locally by the region server" as well as "total number of HDFS blocks for all HFiles" are defined on the whole cluster level. The external program also calculates locality information per hfile, region as well as per region server. It uses HDFS namenode and the calculation is independent of any map reduce jobs. 2. In terms of how we can calculate this metrics inside hbase, we can do in two steps. first one is to calcluate the metrics independent of map reduce jobs; the second one is to calcuate it on per map reduce job level. 3. Calculate on the locality index, independent of map reduce jobs. a. It will first be calcuated on hfile level { total # of HDFS block, total # of local HDFS blocks } ; then the data get aggregated on region level, finally get aggregated on region server level. b. Impact on namenode. There are 2 RPC calls to NN to get block info for each hfile. If we assume 100 regions per RS, 10 hfiles per region, 500 RSs, we will have 1M RPC hits to NN. Most of the time, that won't be an issue if we only calculate hfile locality index when hfile is created or region is loaded by the RS the first time. Because HDFS can still move HDFS blocks around without hbase knowing it, we still need to refresh the value periodically. c. The computation can be done in RS or HMaster. It seems RS is better in terms of design(only store knows the HDFS path for hfile location, HMaster doesn't) and extensiblity(to calculate locality index per map reduce job). The locality index can be part of HServerLoad and RegionLoad for load balancer to use. RS will rotate through all regions periodically in its main thread. The calcuation interval defined by by "hbase.regionserver.msginterval" might be too short for this scenario to minimize the load to NN for a large cluster ( 20 NN RPC per RS per 3 sec ). d. The locality index can be a new RS metrics. We can also put it on table.jsp for each region. e. HRegionInfo is kind of static. It doesn't change over time, however, locality index changes overtime for a given region. Maybe ClusterStatus/HServerInfo/HServerLoad/RegionLoad are better? 4. Locality index calculation for scan / map reduce job. a. The original scenario is for full table scan only. If we want to provide accurate locality index for any scan / map reduce, this could be tricky given i) map reduce job can have start/end keys and filters such as time range; ii) block cache can be used and thus hfile shouldn't be accounted for if there is cache hit. iii) data volume read from HDFS block is also a factor. Reading smaller buffer is different from reading bigger buffer. b. One useful scenario is we want to find out why map jobs run slower sometimes. So it is useful the metrics is just there as part of map reduce job status page. We can estimate by using ganglia page to get the locality index value for the RSs at the time map reduce job is run. c. To provide more accurate data, we can modify TableInputFormat, a) call HBaseAdmin.getClusterStatus to get the locality index info for each region. b) calculate the intersection between scan specification and ClusterStatus based on key range as well as column family. It isn't 100% accurate, but it might be good enough. d. To be really accurate, region server needs to provide locality index for each scan operation back to the client.
        Hide
        Ming Ma added a comment -

        Thanks, Stack, Ted.

        Here is the draft.

        1. The locality index is calculated when StoreFile is opened and cached for the duration of the object.
        2. RS will provide a new metrics called hdfsBlockLocalityIndex based on StoreFile's cached value.
        3. There is some static helper functions for load balancer to compute the block distribution on demand.

        Show
        Ming Ma added a comment - Thanks, Stack, Ted. Here is the draft. 1. The locality index is calculated when StoreFile is opened and cached for the duration of the object. 2. RS will provide a new metrics called hdfsBlockLocalityIndex based on StoreFile's cached value. 3. There is some static helper functions for load balancer to compute the block distribution on demand.
        Hide
        Ted Yu added a comment - - edited

        Nice patch Ming. This would be very useful for enhancing load balancing.

        Please use two spaces for tab.

        +            HDFSBlocksDistribution blocksdistriforstorefile = sf.getHDFSBlockDistribution();
        

        Please use camel case for the variable name: storeFileBlocksDistribution.

        Show
        Ted Yu added a comment - - edited Nice patch Ming. This would be very useful for enhancing load balancing. Please use two spaces for tab. + HDFSBlocksDistribution blocksdistriforstorefile = sf.getHDFSBlockDistribution(); Please use camel case for the variable name: storeFileBlocksDistribution.
        Hide
        Ted Yu added a comment -

        getTopBlockLocations() calls HRegion.computeHDFSBlocksDistribution() which has this code:

              HTableDescriptor tableDescriptor = FSUtils.getHTableDescriptor(conf, region.getTableNameAsString());
        

        It is an expensive call. We should look for faster alternative.

        Show
        Ted Yu added a comment - getTopBlockLocations() calls HRegion.computeHDFSBlocksDistribution() which has this code: HTableDescriptor tableDescriptor = FSUtils.getHTableDescriptor(conf, region.getTableNameAsString()); It is an expensive call. We should look for faster alternative.
        Hide
        Ted Yu added a comment -

        Do you see the following test failure, Ming ?

        Tests in error: 
          Broken_testSync(org.apache.hadoop.hbase.regionserver.wal.TestHLog): hdfs://localhost.localdomain:10354/user/hadoop/TestHLog/hlogdir/hlog.1312605480262, entryStart=19950029, pos=20971520, end=20971520, edit=59
        
        Show
        Ted Yu added a comment - Do you see the following test failure, Ming ? Tests in error: Broken_testSync(org.apache.hadoop.hbase.regionserver.wal.TestHLog): hdfs: //localhost.localdomain:10354/user/hadoop/TestHLog/hlogdir/hlog.1312605480262, entryStart=19950029, pos=20971520, end=20971520, edit=59
        Hide
        Ted Yu added a comment -
        +  public long getUniqueBlocksWeights()
        +  {
        +      return UniqueBlockTotalWeights;
        +  }
        

        Please rename the long field uniqueBlocksTotalWeight and rename the method getUniqueBlocksTotalWeight.

        Show
        Ted Yu added a comment - + public long getUniqueBlocksWeights() + { + return UniqueBlockTotalWeights; + } Please rename the long field uniqueBlocksTotalWeight and rename the method getUniqueBlocksTotalWeight.
        Hide
        Ming Ma added a comment -

        Thanks, Ted.

        Here is the update.

        Show
        Ming Ma added a comment - Thanks, Ted. Here is the update.
        Hide
        Ming Ma added a comment -

        More update to fix code style.

        Show
        Ming Ma added a comment - More update to fix code style.
        Hide
        Ming Ma added a comment -

        Fix copyright year for new file HDFSBlocksDistribution.java.

        Show
        Ming Ma added a comment - Fix copyright year for new file HDFSBlocksDistribution.java.
        Hide
        Ted Yu added a comment -
        +  private HTableDescriptor getTableDescriptor(byte[] tableName)
        +    throws TableExistsException, FileNotFoundException, IOException {
        +    return this.services.getTableDescriptors().get(Bytes.toString(tableName));
        

        I think we should handle the first two exceptions declared above.

        Show
        Ted Yu added a comment - + private HTableDescriptor getTableDescriptor( byte [] tableName) + throws TableExistsException, FileNotFoundException, IOException { + return this .services.getTableDescriptors().get(Bytes.toString(tableName)); I think we should handle the first two exceptions declared above.
        Hide
        Ming Ma added a comment -

        Thanks, Ted. Here is the fix.

        Show
        Ming Ma added a comment - Thanks, Ted. Here is the fix.
        Hide
        Ted Yu added a comment -

        +1 on patch.

        Show
        Ted Yu added a comment - +1 on patch.
        Hide
        stack added a comment -

        Very nice work Ming. Patch looks good. Nice tests (I didn't now you could set hostnames in minidfscluster).

        To be clear, we're adding a lookup of a storefiles filestatus after each open of a storefile (pity open doesn't just return this). Thats extra loading on namenode but should be fine I think in the scheme of things (I see you did calc's above – so yeah, should be fine)

        One method is named with a capital: MapHostNameToServerName. This is unorthodox. Other "heresies" are the new line before a function curly-bracket open, sometimes, or on loops.

        In HostAndWeight, what if two regionservers running on same host? Should it be HostPortAndWeight?

        You should just do a return on the string in the toString:

        +    String s = "number of unique hosts in the disribution=" +
        +      this.hostAndWeights.size();
        +    return s;
        

        ... and do we need that preamble 'number of unique hosts in the ...' (is this string right?)

        Is getTopBlockLocations actually used by the balancer? I don't see it.

        You might want to protect yourself against clusterstatus and master services being null in the regionserver.

        Show
        stack added a comment - Very nice work Ming. Patch looks good. Nice tests (I didn't now you could set hostnames in minidfscluster). To be clear, we're adding a lookup of a storefiles filestatus after each open of a storefile (pity open doesn't just return this). Thats extra loading on namenode but should be fine I think in the scheme of things (I see you did calc's above – so yeah, should be fine) One method is named with a capital: MapHostNameToServerName. This is unorthodox. Other "heresies" are the new line before a function curly-bracket open, sometimes, or on loops. In HostAndWeight, what if two regionservers running on same host? Should it be HostPortAndWeight? You should just do a return on the string in the toString: + String s = "number of unique hosts in the disribution=" + + this .hostAndWeights.size(); + return s; ... and do we need that preamble 'number of unique hosts in the ...' (is this string right?) Is getTopBlockLocations actually used by the balancer? I don't see it. You might want to protect yourself against clusterstatus and master services being null in the regionserver.
        Hide
        Ted Yu added a comment -

        getTopBlockLocations() will be used in another JIRA which enhances load balancer.

        Clusterstatus and master services are passed from master to balancer. Region server isn't involved.

        Regards

        Show
        Ted Yu added a comment - getTopBlockLocations() will be used in another JIRA which enhances load balancer. Clusterstatus and master services are passed from master to balancer. Region server isn't involved. Regards
        Hide
        stack added a comment -

        ...in the regionserver.

        Should have been 'in the loadbalancer'

        Show
        stack added a comment - ...in the regionserver. Should have been 'in the loadbalancer'
        Hide
        Ming Ma added a comment -

        Thanks, Stack. Here is the fix for everything except "what if two regionservers running on same host"

        HostAndWeight is used to capture general HDFS block distribution, these are datanode hosts; hbase isn't involved.. RS at runtime will query HostAndWeight with its own host name. If there are two RS instances on the same host, each will has its own hfiles instances' HostAndWeight and aggregate them independently.

        Show
        Ming Ma added a comment - Thanks, Stack. Here is the fix for everything except "what if two regionservers running on same host" HostAndWeight is used to capture general HDFS block distribution, these are datanode hosts; hbase isn't involved.. RS at runtime will query HostAndWeight with its own host name. If there are two RS instances on the same host, each will has its own hfiles instances' HostAndWeight and aggregate them independently.
        Hide
        stack added a comment -

        Committed to TRUNK. Thanks for the patch Ming (And the reviewing Ted)

        Show
        stack added a comment - Committed to TRUNK. Thanks for the patch Ming (And the reviewing Ted)
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2108 (See https://builds.apache.org/job/HBase-TRUNK/2108/)
        HBASE-4114 Metrics for HFile HDFS block locality

        stack :
        Files :

        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HDFSBlocksDistribution.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/LoadBalancer.java
        • /hbase/trunk/CHANGES.txt
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/TestFSUtils.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/RegionServerMetrics.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/FSUtils.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2108 (See https://builds.apache.org/job/HBase-TRUNK/2108/ ) HBASE-4114 Metrics for HFile HDFS block locality stack : Files : /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HDFSBlocksDistribution.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/LoadBalancer.java /hbase/trunk/CHANGES.txt /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/TestFSUtils.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/RegionServerMetrics.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/FSUtils.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
        Hide
        Jonathan Hsieh added a comment -

        In LoadBalancer.java, there are two private functions that seem to be dead code — gettopblocklocations (Later changed to getTopBlockLocations in trunk) and getTableDescriptor.

        Are they going to be used in some subsequent patch or is this just code that snuck in?

        Show
        Jonathan Hsieh added a comment - In LoadBalancer.java, there are two private functions that seem to be dead code — gettopblocklocations (Later changed to getTopBlockLocations in trunk) and getTableDescriptor. Are they going to be used in some subsequent patch or is this just code that snuck in?
        Hide
        Ted Yu added a comment -

        LoadBalancer.getTableDescriptor is called by getTopBlockLocations in TRUNK.

        I have a patch which utilizes getTopBlockLocations in balanceCluster(). I haven't found time to test my patch yet.

        Show
        Ted Yu added a comment - LoadBalancer.getTableDescriptor is called by getTopBlockLocations in TRUNK. I have a patch which utilizes getTopBlockLocations in balanceCluster(). I haven't found time to test my patch yet.

          People

          • Assignee:
            Ming Ma
            Reporter:
            Ming Ma
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development