HBase
  1. HBase
  2. HBASE-2165

Improve fragmentation display and implementation

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.20.4, 0.90.0
    • Fix Version/s: 0.92.0
    • Component/s: None
    • Labels:
      None

      Description

      Improve by

      • moving the "blocking" FS scan into a thread so that the UI loads fast and initially displays "n/a" but once it has completed the scan it displays the proper numbers
      • explaining what fragmentation means to the user (better hints or help text in UI)
      • Switch ROOT (and maybe even .META.?) to simply say "Yes" or a tick that it is fragmented as it only has 0% or 100% available (since it has only a single region)
      • also computing the space occupied by each table and the total and - if easily done - add a graph to display it (Google Pie Chart would be nice but is an external link)

        Issue Links

          Activity

          Hide
          Jean-Daniel Cryans added a comment -

          I love these ideas.

          Show
          Jean-Daniel Cryans added a comment - I love these ideas.
          Hide
          Lars George added a comment -

          From HBASE-2181:

          Given the potentially low and misleading value of the metric, and how much effort must be expended to collect them, I would argue at least we should allow users to disable the feature completely.
          The first problem is the data the metric delivers is not very useful. On any given busy system, this value is often 100%. On a sample system here, 12% of the tables were at either 0 or 100%. Furthermore the 100% metric is not particularly informative. If a table has 100% 'fragmentation' it does not necessarily imply that this table is in dire need of compaction. The HBase compaction code will generally keep at least 2 store files around - it refuses to minor compact older and larger files, preferring to merge small files. Thus on a table taking writes on all regions, the expected value of fragmentation is in fact 100%. And this is not a bad thing either. Considering that compacting a 500GB table will take an hour and hammer a cluster, misleading users into striving to get to 0% is non ideal.

          The other major problem of this feature is collecting the data is non-trivial on larger clusters. I did a test where I did a lsr on a hadoop cluster, and to generate 15k lines of output, it pegged the namenode at over 100% cpu for a few seconds. On a cluster with 7000 regions, we can clearly easily have 14,000 (2 store files per region is typical) files thus causing spikes against the namenode to generate this statistic.

          I would propose 3 courses of actions:

          • allow complete disablement of the feature, including the background thread and the UI display
          • change the metric to mean '# of regions with > 5 store files'
          • replacing the metric with a completely different one that attempts to capture the spirit of the intent but with less load.
          Show
          Lars George added a comment - From HBASE-2181 : Given the potentially low and misleading value of the metric, and how much effort must be expended to collect them, I would argue at least we should allow users to disable the feature completely. The first problem is the data the metric delivers is not very useful. On any given busy system, this value is often 100%. On a sample system here, 12% of the tables were at either 0 or 100%. Furthermore the 100% metric is not particularly informative. If a table has 100% 'fragmentation' it does not necessarily imply that this table is in dire need of compaction. The HBase compaction code will generally keep at least 2 store files around - it refuses to minor compact older and larger files, preferring to merge small files. Thus on a table taking writes on all regions, the expected value of fragmentation is in fact 100%. And this is not a bad thing either. Considering that compacting a 500GB table will take an hour and hammer a cluster, misleading users into striving to get to 0% is non ideal. The other major problem of this feature is collecting the data is non-trivial on larger clusters. I did a test where I did a lsr on a hadoop cluster, and to generate 15k lines of output, it pegged the namenode at over 100% cpu for a few seconds. On a cluster with 7000 regions, we can clearly easily have 14,000 (2 store files per region is typical) files thus causing spikes against the namenode to generate this statistic. I would propose 3 courses of actions: allow complete disablement of the feature, including the background thread and the UI display change the metric to mean '# of regions with > 5 store files' replacing the metric with a completely different one that attempts to capture the spirit of the intent but with less load.
          Hide
          Lars George added a comment -

          I agree with Ryan's suggestions in HBASE-2181, so also add

          • feature to switch it off (also hide UI components then)
          • make "# of regions to consider fragmented" a (yet hidden) config value defaulting to 5

          Do you think doing an dfs -lsr /hbase is too resource intensive given we only do it every configured interval and in the background? Could we throttle it?

          Show
          Lars George added a comment - I agree with Ryan's suggestions in HBASE-2181 , so also add feature to switch it off (also hide UI components then) make "# of regions to consider fragmented" a (yet hidden) config value defaulting to 5 Do you think doing an dfs -lsr /hbase is too resource intensive given we only do it every configured interval and in the background? Could we throttle it?
          Hide
          stack added a comment -

          I don't know how to throttle it. Its a query to NN if IIRC so maybe not too bad. Should be done rarely though – every 5 minutes? Every ten minutes? Can you check it? See if under load hbase throughput goes down when full lsr or just browse how this is done in hadoop code to check no big synchronization block or something over on the NN?

          I agree needs more explaination at a minimum. I'm still not sure what it indicates (Witness Cosmin question up on IRC yesterday).

          Show
          stack added a comment - I don't know how to throttle it. Its a query to NN if IIRC so maybe not too bad. Should be done rarely though – every 5 minutes? Every ten minutes? Can you check it? See if under load hbase throughput goes down when full lsr or just browse how this is done in hadoop code to check no big synchronization block or something over on the NN? I agree needs more explaination at a minimum. I'm still not sure what it indicates (Witness Cosmin question up on IRC yesterday).
          Hide
          Lars George added a comment -

          @Stack, I think that is why Ryan posted his new issue (which I closed as dupe and copied here). The idea is to show the experienced (and that may be the crux) user what the tables are up to. I recall being asked how many store files my .META. has and when I said 10 I heard "Why too much! Do compact". With the UI I am trying to get anyone quick access to this sort of information. As I said, it may not be the best way to do it as a write intense system will always be fragmented.

          Maybe we print not the percentage but number of regions vs. storefiles to get a ratio that could also be used to see how many files (on average then I guess) each table has. The other issue I though is size. I am wondering often how large a table is, not rows but in terms of storage. Sure a smart "hadoop dfs -du /hbase/tablename" or so gives the same. But the UI would be a nice spot to have this sort of info. And if I scan every configurable interval N anyways to see how many storefiles I have then the size is a free info. Also setting that interval to 0 or -1 would disable the whole thing altogether and not worry anyone. Maybe we disable it by default, that way no one wonders what that means and the experienced user can enable it as an advanced feature. As I said above, when this is disabled I would even hide those columns so that there is no constant "n/a" shown. Does this make more sense?

          Show
          Lars George added a comment - @Stack, I think that is why Ryan posted his new issue (which I closed as dupe and copied here). The idea is to show the experienced (and that may be the crux) user what the tables are up to. I recall being asked how many store files my .META. has and when I said 10 I heard "Why too much! Do compact". With the UI I am trying to get anyone quick access to this sort of information. As I said, it may not be the best way to do it as a write intense system will always be fragmented. Maybe we print not the percentage but number of regions vs. storefiles to get a ratio that could also be used to see how many files (on average then I guess) each table has. The other issue I though is size. I am wondering often how large a table is, not rows but in terms of storage. Sure a smart "hadoop dfs -du /hbase/tablename" or so gives the same. But the UI would be a nice spot to have this sort of info. And if I scan every configurable interval N anyways to see how many storefiles I have then the size is a free info. Also setting that interval to 0 or -1 would disable the whole thing altogether and not worry anyone. Maybe we disable it by default, that way no one wonders what that means and the experienced user can enable it as an advanced feature. As I said above, when this is disabled I would even hide those columns so that there is no constant "n/a" shown. Does this make more sense?
          Hide
          ryan rawson added a comment -

          my point is that:

          • experienced user (me) finds the display pointless as is
          • large clusters means hammering the NN to display a pointless stat...

          We need to up the value of this stat to justify the resources consumed retrieving it, reduce the resources or potentially cut it completely.

          Show
          ryan rawson added a comment - my point is that: experienced user (me) finds the display pointless as is large clusters means hammering the NN to display a pointless stat... We need to up the value of this stat to justify the resources consumed retrieving it, reduce the resources or potentially cut it completely.
          Hide
          Lars George added a comment -

          @ryan, not sure if I understand. As we discussed above, we would query not at all or only every N minutes, and since it hits the NN only it should be fast and not too resource intense as it has it all in RAM I would suppose.

          Also, you are not the target audience, you are not a "user" you are a hbase developer. I am looking at those who use hbase in production but are not involved otherwise. Every bit of information we can give them helps I'd say. And since we do not switch this on by default it is left to those that wish to have the number to further see what their cluster does and its status-quo.

          As I said above the reasoning is that a terribly fragmented cluster is going slower than it should be. Just that rectifies why someone may want to know. It also adds the "size" of tables, something that also is not visible right now to users. How much raw data is stored, that may often be enough they need and no rowcounter run is required.

          Show
          Lars George added a comment - @ryan, not sure if I understand. As we discussed above, we would query not at all or only every N minutes, and since it hits the NN only it should be fast and not too resource intense as it has it all in RAM I would suppose. Also, you are not the target audience, you are not a "user" you are a hbase developer. I am looking at those who use hbase in production but are not involved otherwise. Every bit of information we can give them helps I'd say. And since we do not switch this on by default it is left to those that wish to have the number to further see what their cluster does and its status-quo. As I said above the reasoning is that a terribly fragmented cluster is going slower than it should be. Just that rectifies why someone may want to know. It also adds the "size" of tables, something that also is not visible right now to users. How much raw data is stored, that may often be enough they need and no rowcounter run is required.
          Hide
          Jean-Daniel Cryans added a comment -

          Saw an issue with fragmentation today in the web UI, I refreshed it around the same time a region was deleted and FileNotFound was thrown.

          Show
          Jean-Daniel Cryans added a comment - Saw an issue with fragmentation today in the web UI, I refreshed it around the same time a region was deleted and FileNotFound was thrown.
          Hide
          Lars George added a comment -

          Thanks JD, noted. The fs scan needs to be guarded for IOExceptions. Will do that when I redo this stuff.

          Show
          Lars George added a comment - Thanks JD, noted. The fs scan needs to be guarded for IOExceptions. Will do that when I redo this stuff.
          Hide
          ryan rawson added a comment -

          I'm not sure its perfectly clear that 'more fragmented = slower'. certainly we should be as fast under 2 store files as 1, since it is nearly impossible on a real cluster to get to 1 store file per region.

          Another issue, is this calculation is done inline with the web hit (contrary to what everyone has told me), jstack says thus:

          "1674385661@qtp0-5" prio=10 tid=0x0000000041f25800 nid=0x419a in Object.wait() [0x00007f43b0fe8000]
          java.lang.Thread.State: WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)
          at java.lang.Object.wait(Object.java:485)
          at org.apache.hadoop.ipc.Client.call(Client.java:752)

          • locked <0x00007f43c77725f8> (a org.apache.hadoop.ipc.Client$Call)
            at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:223)
            at $Proxy0.getListing(Unknown Source)
            at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
            at java.lang.reflect.Method.invoke(Method.java:597)
            at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
            at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
            at $Proxy0.getListing(Unknown Source)
            at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:709)
            at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:289)
            at org.apache.hadoop.hbase.util.FSUtils.getTableFragmentation(FSUtils.java:441)
            at org.apache.hadoop.hbase.util.FSUtils.getTableFragmentation(FSUtils.java:395)
            at org.apache.hadoop.hbase.master.HMaster.getTableFragmentation(HMaster.java:1265)
            at org.apache.hadoop.hbase.generated.master.master_jsp._jspService(master_jsp.java:69)
            at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
            at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
            at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
            at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
            at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:667)
            at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
            at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
            at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
            at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
            at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
            at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
            at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
            at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
            at org.mortbay.jetty.Server.handle(Server.java:324)
            at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
            at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
            at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
            at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
            at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
            at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
            at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
          Show
          ryan rawson added a comment - I'm not sure its perfectly clear that 'more fragmented = slower'. certainly we should be as fast under 2 store files as 1, since it is nearly impossible on a real cluster to get to 1 store file per region. Another issue, is this calculation is done inline with the web hit (contrary to what everyone has told me), jstack says thus: "1674385661@qtp0-5" prio=10 tid=0x0000000041f25800 nid=0x419a in Object.wait() [0x00007f43b0fe8000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:752) locked <0x00007f43c77725f8> (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:223) at $Proxy0.getListing(Unknown Source) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.getListing(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:709) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:289) at org.apache.hadoop.hbase.util.FSUtils.getTableFragmentation(FSUtils.java:441) at org.apache.hadoop.hbase.util.FSUtils.getTableFragmentation(FSUtils.java:395) at org.apache.hadoop.hbase.master.HMaster.getTableFragmentation(HMaster.java:1265) at org.apache.hadoop.hbase.generated.master.master_jsp._jspService(master_jsp.java:69) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:667) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
          Hide
          Lars George added a comment -

          @ryan: this jira issue is about to improve on that fact.

          Show
          Lars George added a comment - @ryan: this jira issue is about to improve on that fact.
          Hide
          stack added a comment -

          Lets remove fragmentation for 0.20.4 release. I'll make a new issue and move this out to 0.20.5.

          Show
          stack added a comment - Lets remove fragmentation for 0.20.4 release. I'll make a new issue and move this out to 0.20.5.
          Hide
          ryan rawson added a comment -

          PS: I disabled this in SU's internal build for trunk because it makes the master UI impossible to use on a larger cluster.

          Show
          ryan rawson added a comment - PS: I disabled this in SU's internal build for trunk because it makes the master UI impossible to use on a larger cluster.
          Hide
          Lars George added a comment -

          Hmm, this was a full reversal, bummer I did not know you were that "desperate". The changes also included values that were added to the RegionServerMetrics and provided information about the compaction queues - which was fast and could have stayed (also see HBASE-2143). In fact only the JSP's should have been edited because they trigger the slow DFS scan synchronously.

          I removed the whole target for 0.20 as no one is really interested in this by the looks and leave the 0.21.0 target as I may fix this all up there once and for all.

          Show
          Lars George added a comment - Hmm, this was a full reversal, bummer I did not know you were that "desperate". The changes also included values that were added to the RegionServerMetrics and provided information about the compaction queues - which was fast and could have stayed (also see HBASE-2143 ). In fact only the JSP's should have been edited because they trigger the slow DFS scan synchronously. I removed the whole target for 0.20 as no one is really interested in this by the looks and leave the 0.21.0 target as I may fix this all up there once and for all.
          Hide
          stack added a comment -

          Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.

          Show
          stack added a comment - Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.
          Hide
          stack added a comment -

          We don't have this attribute any more. Resolving issue till it comes back.

          Show
          stack added a comment - We don't have this attribute any more. Resolving issue till it comes back.

            People

            • Assignee:
              Unassigned
              Reporter:
              Lars George
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development