Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-4112

Poor performance due to MinC start/stop updates are always hsync'd

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.2, 1.8.0
    • Component/s: tserver
    • Labels:
      None
    • Environment:

      Fluo testing on a 20-node cluster

      Description

      Keith Turner writes:

      I was running a Fluo test with 1.8.0-SNAP on my workstation. My Fluo table had a ton of tablets. I was seeing terrible performance. I started looking at the tserver and noticed it was always calling hsync. I tracked down the problem to the fact that when minc start and stop events are written to the log they are always written w/ sync level. My poor little tserver was constantly minor compacting (probably had around 600 tablets that were all being written to).

      I changed the test config to create like 15 tablets and the performance was much better. All cores were 100% utilized, which was not the case when hsync was always called.

      1. Sync-Flush-Log-Performance.png
        30 kB
        Eric Newton
      2. MinCFlushPerfTest.java
        4 kB
        Eric Newton
      3. HSyncOverheadExperiment.png
        16 kB
        Eric Newton

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner closed the pull request at:

          https://github.com/apache/accumulo/pull/79

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner closed the pull request at: https://github.com/apache/accumulo/pull/79
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/79#discussion_r55533890

          — Diff: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java —
          @@ -2966,13 +2966,26 @@ public Void run() {
          }
          }

          + private Durability getMincEventDurability(KeyExtent extent) {
          + TableConfiguration conf;
          + if (extent.isMeta() || extent.isRootTablet()) {
          — End diff –

          > Is there some other setting for user tables that's more appropriate?

          This is the best setting I know of... Since the walog minc entries are used during recovery, its nice to have them at the same sync level as the tablets metadata.

          > so the || extent.isRootTablet() is redundant

          Nice catch, I wrongly assumed isMeta() was a more narrow check. I used to know it did a wider check, but had forgotten.

          > I think refactoring KeyExtent#isMeta is more appropriate

          Changing its name would need to be a separate PR, because it would be very noisy drowning out the changes for 4112. I agree another name that implies more than `accumulo.metadata` would be good, however I am not sure what that name would be.

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/79#discussion_r55533890 — Diff: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java — @@ -2966,13 +2966,26 @@ public Void run() { } } + private Durability getMincEventDurability(KeyExtent extent) { + TableConfiguration conf; + if (extent.isMeta() || extent.isRootTablet()) { — End diff – > Is there some other setting for user tables that's more appropriate? This is the best setting I know of... Since the walog minc entries are used during recovery, its nice to have them at the same sync level as the tablets metadata. > so the || extent.isRootTablet() is redundant Nice catch, I wrongly assumed isMeta() was a more narrow check. I used to know it did a wider check, but had forgotten. > I think refactoring KeyExtent#isMeta is more appropriate Changing its name would need to be a separate PR, because it would be very noisy drowning out the changes for 4112. I agree another name that implies more than `accumulo.metadata` would be good, however I am not sure what that name would be.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user wjsl commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/79#discussion_r55471617

          — Diff: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java —
          @@ -2966,13 +2966,26 @@ public Void run() {
          }
          }

          + private Durability getMincEventDurability(KeyExtent extent) {
          + TableConfiguration conf;
          + if (extent.isMeta() || extent.isRootTablet()) {
          — End diff –

          This check is a bit confusing. To me it reads "If the extent is for the metadata table or the root tablet, use the root tablet's configuration. Otherwise, use the metadata configuration." Is there some other setting for user tables that's more appropriate?

          `KeyExtent#isMeta` already returns whether or not it's for the metadata table or the root tablet, so the `|| extent.isRootTablet()` is redundant. However, I think refactoring `KeyExtent#isMeta` is more appropriate to make it more clear about what it is checking.

          Show
          githubbot ASF GitHub Bot added a comment - Github user wjsl commented on a diff in the pull request: https://github.com/apache/accumulo/pull/79#discussion_r55471617 — Diff: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java — @@ -2966,13 +2966,26 @@ public Void run() { } } + private Durability getMincEventDurability(KeyExtent extent) { + TableConfiguration conf; + if (extent.isMeta() || extent.isRootTablet()) { — End diff – This check is a bit confusing. To me it reads "If the extent is for the metadata table or the root tablet, use the root tablet's configuration. Otherwise, use the metadata configuration." Is there some other setting for user tables that's more appropriate? `KeyExtent#isMeta` already returns whether or not it's for the metadata table or the root tablet, so the `|| extent.isRootTablet()` is redundant. However, I think refactoring `KeyExtent#isMeta` is more appropriate to make it more clear about what it is checking.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on the pull request:

          https://github.com/apache/accumulo/pull/79#issuecomment-194023887

          :+1: pending some successful ITs.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on the pull request: https://github.com/apache/accumulo/pull/79#issuecomment-194023887 :+1: pending some successful ITs.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user keith-turner opened a pull request:

          https://github.com/apache/accumulo/pull/79

          ACCUMULO-4112 use metadata table durability config when writing minc …

          …events to walog

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/keith-turner/accumulo ACCUMULO-4112

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/accumulo/pull/79.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #79


          commit 714bb77b56f8b7177421908bed44c08bc0866b93
          Author: Keith Turner <kturner@apache.org>
          Date: 2016-03-08T15:19:27Z

          ACCUMULO-4112 use metadata table durability config when writing minc events to walog


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user keith-turner opened a pull request: https://github.com/apache/accumulo/pull/79 ACCUMULO-4112 use metadata table durability config when writing minc … …events to walog You can merge this pull request into a Git repository by running: $ git pull https://github.com/keith-turner/accumulo ACCUMULO-4112 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/accumulo/pull/79.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #79 commit 714bb77b56f8b7177421908bed44c08bc0866b93 Author: Keith Turner <kturner@apache.org> Date: 2016-03-08T15:19:27Z ACCUMULO-4112 use metadata table durability config when writing minc events to walog
          Hide
          ctubbsii Christopher Tubbs added a comment -

          Reopening, because the changes for this should either be reverted or another solution should be applied.

          Show
          ctubbsii Christopher Tubbs added a comment - Reopening, because the changes for this should either be reverted or another solution should be applied.
          Hide
          ctubbsii Christopher Tubbs added a comment -

          Updated the subject line to include "poor performance" to give additional context to the current "Cannot Reproduce" status. (Otherwise, it reads like the start/stop updates being hsync'd couldn't be reproduced, which wouldn't be accurate.)

          Show
          ctubbsii Christopher Tubbs added a comment - Updated the subject line to include "poor performance" to give additional context to the current "Cannot Reproduce" status. (Otherwise, it reads like the start/stop updates being hsync'd couldn't be reproduced, which wouldn't be accurate.)
          Hide
          ecn Eric Newton added a comment -

          Given the possibility of replaying data, I'm going to partially revert 497a884c9da50c6c3620816a0e1fd5321ca6eaed. I'm going to keep the SYNC for the MinC completion.

          Show
          ecn Eric Newton added a comment - Given the possibility of replaying data, I'm going to partially revert 497a884c9da50c6c3620816a0e1fd5321ca6eaed. I'm going to keep the SYNC for the MinC completion.
          Hide
          ecn Eric Newton added a comment -

          I ran another set of experiments, trying to determine the basic costs of SYNC over LOG Durabilities for lots of MinCs.

          The main determining factor for speed was using a "reasonable" IMM. Using 100k IMM for 10 tablets (plus the metadata tablets), leaves less than 10k per tablet. By simply increasing the IMM to 200k, performance is improved (and, probably fewer MinCs occurred).

          So, I've been unable to reproduce the issue that Keith Turner experienced with Fluo in a small test. But I have learned that there is a bare minimum IMM size required for a given number of tablets. And that's 10-20K for each hosted tablet. These are easily affordable on any modern computer even with 1000 tablets.

          I've attached an image with the numbers (sorry, but that's easier than trying to format the wiki markup).

          Show
          ecn Eric Newton added a comment - I ran another set of experiments, trying to determine the basic costs of SYNC over LOG Durabilities for lots of MinCs. The main determining factor for speed was using a "reasonable" IMM. Using 100k IMM for 10 tablets (plus the metadata tablets), leaves less than 10k per tablet. By simply increasing the IMM to 200k, performance is improved (and, probably fewer MinCs occurred). So, I've been unable to reproduce the issue that Keith Turner experienced with Fluo in a small test. But I have learned that there is a bare minimum IMM size required for a given number of tablets. And that's 10-20K for each hosted tablet. These are easily affordable on any modern computer even with 1000 tablets. I've attached an image with the numbers (sorry, but that's easier than trying to format the wiki markup).
          Hide
          ecn Eric Newton added a comment -

          Christopher Tubbs thought of an interesting issue:

          • tablet finishes a minor compaction
          • the metadata table is updated with a new file
          • the WALog gets the "MinC complete" marker (not hsync'd)
          • the tablet proceeds to major compact away the reference to the recent MinC file
          • tablet server crashes before the WALog is sync'd

          In this case, the mutations in the WALog will be replayed, and they exist in the tablet's existing files.

          Perhaps it is better to use the Durability of the appropriate meta table, rather than some fixed Durability value.

          Show
          ecn Eric Newton added a comment - Christopher Tubbs thought of an interesting issue: tablet finishes a minor compaction the metadata table is updated with a new file the WALog gets the "MinC complete" marker (not hsync'd) the tablet proceeds to major compact away the reference to the recent MinC file tablet server crashes before the WALog is sync'd In this case, the mutations in the WALog will be replayed, and they exist in the tablet's existing files. Perhaps it is better to use the Durability of the appropriate meta table, rather than some fixed Durability value.
          Hide
          ecn Eric Newton added a comment -

          Discussing this offline with Keith Turner, it seems that the current recovery code should handle failed writes of the "stop MinC" to the WALog. This is very tricky business, of course. The tablet server updates the metadata table for a MinC before writing the "stop MinC" marker to the WALog, so we have always needed to recover from a failure between the two steps. Failure to write the marker should not add more failures, except if the user is lowering the Durability of their own mutations. But that trade-off is entirely up to the user.

          Show
          ecn Eric Newton added a comment - Discussing this offline with Keith Turner , it seems that the current recovery code should handle failed writes of the "stop MinC" to the WALog. This is very tricky business, of course. The tablet server updates the metadata table for a MinC before writing the "stop MinC" marker to the WALog, so we have always needed to recover from a failure between the two steps. Failure to write the marker should not add more failures, except if the user is lowering the Durability of their own mutations. But that trade-off is entirely up to the user.
          Hide
          ecn Eric Newton added a comment -

          I've updated the test. The test performs a series of Conditional updates on a table with many tablets. I lowered the durability on the metadata table to flush.

          I ran the test sending the "start MinC" and "stop MinC" writes to the WALog at different durabilities. The screenshot of the performance differences is shown.

          It should be noted that the performance for a single tablet is always much faster, even with full sync Durability.

          Show
          ecn Eric Newton added a comment - I've updated the test. The test performs a series of Conditional updates on a table with many tablets. I lowered the durability on the metadata table to flush. I ran the test sending the "start MinC" and "stop MinC" writes to the WALog at different durabilities. The screenshot of the performance differences is shown. It should be noted that the performance for a single tablet is always much faster, even with full sync Durability.
          Hide
          ecn Eric Newton added a comment -

          In an offline conversation with Keith Turner, I learned the Durability for the table (and the metadata table) were set on FLUSH, not SYNC, which will magnify the effects of SYNC when used for MinC start/stop WALog entries.

          Show
          ecn Eric Newton added a comment - In an offline conversation with Keith Turner , I learned the Durability for the table (and the metadata table) were set on FLUSH, not SYNC, which will magnify the effects of SYNC when used for MinC start/stop WALog entries.
          Hide
          ecn Eric Newton added a comment -

          Ah, I'll update the test. Thanks!

          Show
          ecn Eric Newton added a comment - Ah, I'll update the test. Thanks!
          Hide
          kturner Keith Turner added a comment -

          I was using 1.8.0-SNAPSHOT (an older version, I have a branch for ACCUMULO-4066 that I have not rebased in while). Looking at the attached test one difference with my scenarion is it uses BatchWriter. Fluo uses ConditionalWriter and BatchWriter. The ConditionalWriter will be much more sensitive to these pauses.

          Show
          kturner Keith Turner added a comment - I was using 1.8.0-SNAPSHOT (an older version, I have a branch for ACCUMULO-4066 that I have not rebased in while). Looking at the attached test one difference with my scenarion is it uses BatchWriter. Fluo uses ConditionalWriter and BatchWriter. The ConditionalWriter will be much more sensitive to these pauses.
          Hide
          ecn Eric Newton added a comment -

          I wrote a little MAC test for this scenario:

          • Lots of tablets per server.
          • Write lots of mutations uniformly over the tablets.
          • Lots of minor compaction threads.
          • Small in-memory map.

          If I change the Durability to LOG, there's no improvement in performance.

          I had to tweak the test a bit in order to not overwhelm MiniDFS.

          I'll attach the little test. Perhaps there's something that fluo is doing differently?

          BTW, if you are running with something less than 1.8.0-SNAPSHOT you are probably seeing the effect of updating the metadata table with WAL markers (ACCUMULO-3423). The only solution is to use fewer tablets (or release 1.8.0).

          Show
          ecn Eric Newton added a comment - I wrote a little MAC test for this scenario: Lots of tablets per server. Write lots of mutations uniformly over the tablets. Lots of minor compaction threads. Small in-memory map. If I change the Durability to LOG, there's no improvement in performance. I had to tweak the test a bit in order to not overwhelm MiniDFS. I'll attach the little test. Perhaps there's something that fluo is doing differently? BTW, if you are running with something less than 1.8.0-SNAPSHOT you are probably seeing the effect of updating the metadata table with WAL markers ( ACCUMULO-3423 ). The only solution is to use fewer tablets (or release 1.8.0).
          Hide
          ecn Eric Newton added a comment -

          Sync'ing flushes was a conservative decision. We should revisit it.

          I'm very confident we don't need to flush the start. I haven't proven to myself that flushing the stop marker is required.

          Show
          ecn Eric Newton added a comment - Sync'ing flushes was a conservative decision. We should revisit it. I'm very confident we don't need to flush the start. I haven't proven to myself that flushing the stop marker is required.

            People

            • Assignee:
              kturner Keith Turner
              Reporter:
              ecn Eric Newton
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h
                1h

                  Development