Cassandra
  1. Cassandra
  2. CASSANDRA-4205

SSTables are not updated with max timestamp on upgradesstables/compaction leading to non-optimal performance.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Fix Version/s: 1.0.10, 1.1.1
    • Component/s: Core
    • Labels:
      None

      Description

      We upgraded from 0.7.9 to 1.0.7 on a cluster with a heavy update load. After converting all the reads to named column reads instead of get_slice calls, we noticed that we still weren't getting the performance improvements implemented in CASSANDRA-2498. A single named column read was still touching multiple SSTables according to nodetool cfhistograms.

      To verify whether or not this was a reporting issue or a real issue, we ran multiple tests with stress and noticed that it worked as expected. After changing stress so that it ran the read/write test directly in the CF having issues (3 times stress & flush), we noticed that stress also touched multiple SSTables (according to cfhistograms).

      So, the root of the problem is "something" left over from our pre-1.0 days. All SSTables were upgraded with upgradesstables, and have been written and compacted many times since the upgrade (4 months ago). The usage pattern for this CF is that it is constantly read and updated (overwritten), but no deletes.

      After discussing the problem with Brandon Williams on #cassandra, it seems the problem might be because a max timestamp has never been written for the old SSTables that were upgraded from pre 1.0. They have only been compacted, and the max timestamp is not recorded during compactions.

      A suggested fix is to special case this in upgradesstables so that a max timestamp always exists for all SSTables.

      06:08 < driftx> thorkild_: tx. The thing is we don't record the max timestamp on compactions, but we can do it specially for upgradesstables.
      06:08 < driftx> so, nothing in... nothing out.
      06:10 < thorkild_> driftx: ah, so when you upgrade from before the metadata was written, and that data is only feed through upgradesstables and compactions -> never properly written?
      06:10 < thorkild_> that makes sense.
      06:11 < driftx> right, we never create it, we just reuse it

      1. 4205.txt
        3 kB
        Jonathan Ellis

        Activity

        Hide
        Jonathan Ellis added a comment -

        Thanks for following up, Thorkild.

        Show
        Jonathan Ellis added a comment - Thanks for following up, Thorkild.
        Hide
        Thorkild Stray added a comment -

        The original description is wrong (as we now know). Backporting the tool from 4211 and testing shows that the SSTables do indeed have the correct settings, and everything is updated correctly during the various compaction/upgradesstables etc.

        After further debugging I found that the reason why I was hitting all the SSTables was due to the CollationController falling back to calling collectAllData(). This was due to the filter not being an instance of NamesQueryFilter:

        CollationController.java:

        public ColumnFamily getTopLevelColumns()
            {
                return filter.filter instanceof NamesQueryFilter
                       && (cfs.metadata.cfType == ColumnFamilyType.Standard || filter.path.superColumnName != null)
                       && cfs.metadata.getDefaultValidator() != CounterColumnType.instance
                       ? collectTimeOrderedData()
                       : collectAllData();
            }
        

        This makes sense, but I struggled to find out why that happened on some CFs, and not on other. After tracing it, I found that the trigger is whether or not the CF has the row-cache enabled. If the row-cache is enabled, ColumnFamilyStore.cacheRow(..) is called, and if you then get a cache-miss, it'll end up calling CoumnFamilyStore.getTopLevelColumns(...), which again creates a new CollationController with an IdentityFilter filter as argument, and thus end up using collectAllData().

        In our case, we have a lot of cache-misses since the update frequency is so high and often we only read an entry once before it is replaced.

        Testing with 'stress' before and after showed it moved from using all sstables for every query, to 100% reads satisfied with a single SSTable read (according to cfhistograms).

        I tested a preliminary configuration with row-cache in production, and it looked like a lot more of the queries were served by a single sstable. We still have stragglers using more, but that can easily be because of the randomization size-tiered compaction will lead to in this case.

        I don't see an easy way to fix this properly, though, without changing the row cache to be a row+filter cache (covered by CASSANDRA-1956 it seems). Since you need to load the whole row into the row cache, you can't use named columns.

        Show
        Thorkild Stray added a comment - The original description is wrong (as we now know). Backporting the tool from 4211 and testing shows that the SSTables do indeed have the correct settings, and everything is updated correctly during the various compaction/upgradesstables etc. After further debugging I found that the reason why I was hitting all the SSTables was due to the CollationController falling back to calling collectAllData(). This was due to the filter not being an instance of NamesQueryFilter: CollationController.java: public ColumnFamily getTopLevelColumns() { return filter.filter instanceof NamesQueryFilter && (cfs.metadata.cfType == ColumnFamilyType.Standard || filter.path.superColumnName != null) && cfs.metadata.getDefaultValidator() != CounterColumnType.instance ? collectTimeOrderedData() : collectAllData(); } This makes sense, but I struggled to find out why that happened on some CFs, and not on other. After tracing it, I found that the trigger is whether or not the CF has the row-cache enabled. If the row-cache is enabled, ColumnFamilyStore.cacheRow(..) is called, and if you then get a cache-miss, it'll end up calling CoumnFamilyStore.getTopLevelColumns(...), which again creates a new CollationController with an IdentityFilter filter as argument, and thus end up using collectAllData(). In our case, we have a lot of cache-misses since the update frequency is so high and often we only read an entry once before it is replaced. Testing with 'stress' before and after showed it moved from using all sstables for every query, to 100% reads satisfied with a single SSTable read (according to cfhistograms). I tested a preliminary configuration with row-cache in production, and it looked like a lot more of the queries were served by a single sstable. We still have stragglers using more, but that can easily be because of the randomization size-tiered compaction will lead to in this case. I don't see an easy way to fix this properly, though, without changing the row cache to be a row+filter cache (covered by CASSANDRA-1956 it seems). Since you need to load the whole row into the row cache, you can't use named columns.
        Hide
        Jonathan Ellis added a comment -

        good catch, committed w/ current version fix

        Show
        Jonathan Ellis added a comment - good catch, committed w/ current version fix
        Hide
        Yuki Morishita added a comment -

        Jonathan,

        Re: patch, you have to bump up CURRENT_VERSION to "hd" but otherwise looking good to me.

        Show
        Yuki Morishita added a comment - Jonathan, Re: patch, you have to bump up CURRENT_VERSION to "hd" but otherwise looking good to me.
        Hide
        Jonathan Ellis added a comment -

        Checked Brandon's sstables with CASSANDRA-4211. One had a max timestamp of 1335979912016 after upgradesstables, the other 1335979951175. cfhistograms showed about 20% of reads hit both sstables.

        This sounds about right; it's reasonable that the sstable w/ higher max timestamp, will contain some rows w/ actually an older version than the other sstable's max. The CASSANDRA-2498 approach will always start with the newer sstable, but it will only skip the older one if the first-seen version of the columns requested have a timestamp newer than the max on the older sstable.

        So, I think there is no bug here related to upgraded timestamps, it just isn't a magic bullet to prevent all multiple sstable reads.

        The patch for creating a new version to represent "we have correct timestamps including row tombstones" is still relevant, though.

        Show
        Jonathan Ellis added a comment - Checked Brandon's sstables with CASSANDRA-4211 . One had a max timestamp of 1335979912016 after upgradesstables, the other 1335979951175. cfhistograms showed about 20% of reads hit both sstables. This sounds about right; it's reasonable that the sstable w/ higher max timestamp, will contain some rows w/ actually an older version than the other sstable's max. The CASSANDRA-2498 approach will always start with the newer sstable, but it will only skip the older one if the first-seen version of the columns requested have a timestamp newer than the max on the older sstable. So, I think there is no bug here related to upgraded timestamps, it just isn't a magic bullet to prevent all multiple sstable reads. The patch for creating a new version to represent "we have correct timestamps including row tombstones" is still relevant, though.
        Hide
        Brandon Williams added a comment -

        I can reproduce, upgrading from 0.7 to 1.0 and running upgradesstables afterward.

        Show
        Brandon Williams added a comment - I can reproduce, upgrading from 0.7 to 1.0 and running upgradesstables afterward.
        Hide
        Jonathan Ellis added a comment -

        patch to add version hd indicating row tombstones have been computed correctly

        Show
        Jonathan Ellis added a comment - patch to add version hd indicating row tombstones have been computed correctly
        Hide
        Jonathan Ellis added a comment -

        A suggested fix is to special case this in upgradesstables so that a max timestamp always exists for all SSTables.

        Looking at the code, scrub and upgradesstables and user-defined compactions all force deserialize + maxtimestamp computation. The only operation that does not is cleanup.

        Show
        Jonathan Ellis added a comment - A suggested fix is to special case this in upgradesstables so that a max timestamp always exists for all SSTables. Looking at the code, scrub and upgradesstables and user-defined compactions all force deserialize + maxtimestamp computation. The only operation that does not is cleanup.
        Hide
        Jonathan Ellis added a comment -

        We have a related but more serious bug, as noted by Sam Tunnicliffe in CASSANDRA-4116: CF.maxTimestamp isn't including row tombstones. This can result in incorrect results being returned by the collationcontroller code.

        Show
        Jonathan Ellis added a comment - We have a related but more serious bug, as noted by Sam Tunnicliffe in CASSANDRA-4116 : CF.maxTimestamp isn't including row tombstones. This can result in incorrect results being returned by the collationcontroller code.

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Thorkild Stray
            Reviewer:
            Yuki Morishita
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development