Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-14415

Performance regression in queries for distinct keys

    XMLWordPrintableJSON

Details

    Description

      Running Cassandra 3.0.16, we observed a major performance regression affecting SELECT DISTINCT keys-style queries against certain tables.  Based on some investigation (guided by some helpful feedback from Benjamin on the dev list), we tracked the regression down to two problems.

      • One is that Cassandra was reading more data from disk than was necessary to satisfy the query.  This was fixed under CASSANDRA-10657 in a later 3.x release.
      • If the fix for CASSANDRA-10657 is incorporated, the other is this code snippet in RebufferingInputStream:
            @Override
            public int skipBytes(int n) throws IOException
            {
                if (n < 0)
                    return 0;
                int requested = n;
                int position = buffer.position(), limit = buffer.limit(), remaining;
                while ((remaining = limit - position) < n)
                {
                    n -= remaining;
                    buffer.position(limit);
                    reBuffer();
                    position = buffer.position();
                    limit = buffer.limit();
                    if (position == limit)
                        return requested - n;
                }
                buffer.position(position + n);
                return requested;
            }
        

        The gist of it is that to skip bytes, the stream needs to read those bytes into memory then throw them away.  In our tests, we were spending a lot of time in this method, so it looked like the chief drag on performance.

      We noticed that the subclass of RebufferingInputStream in use for our queries, RandomAccessReader (over compressed sstables), implements a seek() method.  Overriding skipBytes() in it to use seek() instead was sufficient to fix the performance regression.

      The performance difference is significant for tables with large values.  It's straightforward to evaluate with very simple key-value tables, e.g.:

      CREATE TABLE testtable (key TEXT PRIMARY KEY, value BLOB);

      We did some basic experimentation with the following variations (all in a single-node 3.11.2 cluster with off-the-shelf settings running on a dev workstation):

      • small values (1 KB, 100,000 entries), somewhat larger values (25 KB, 10,000 entries), and much larger values (1 MB, 10,000 entries);
      • compressible data (a single byte repeated) and uncompressible data (output from openssl rand $bytes); and
      • with and without sstable compression.  (With compression, we use Cassandra's defaults.)

      The difference is most conspicuous for tables with large, uncompressible data and sstable decompression (which happens to describe the use case that triggered our investigation).  It is smaller but still readily apparent for tables with effective compression.  For uncompressible data without compression enabled, there is no appreciable difference.

      Here's what the performance looks like without our patch for the 1-MB entries (times in seconds, five consecutive runs for each data set, all exhausting the results from a SELECT DISTINCT key FROM ... query with a page size of 24):

      working on compressible
      5.21180510521
      5.10270500183
      5.22311806679
      4.6732840538
      4.84219098091
      working on uncompressible_uncompressed
      55.0423607826
      0.769015073776
      0.850513935089
      0.713396072388
      0.62596988678
      working on uncompressible
      413.292617083
      231.345913887
      449.524993896
      425.135111094
      243.469946861
      

      and with the fix:

      working on compressible
      2.86733293533
      1.24895811081
      1.108907938
      1.12742400169
      1.04647302628
      working on uncompressible_uncompressed
      56.4146180153
      0.895509958267
      0.922824144363
      0.772884130478
      0.731923818588
      working on uncompressible
      64.4587619305
      1.81325793266
      1.52577018738
      1.41769099236
      1.60442209244
      

      The long initial runs for the uncompressible data presumably come from repeatedly hitting the disk.  In contrast to the runs without the fix, the initial runs seem to be effective at warming the page cache (as lots of data is skipped, so the data that's read can fit in memory), so subsequent runs are faster.

      For smaller data sets, RandomAccessReader.seek() and RebufferingInputStream.skipBytes() are approximately equivalent in their behavior (reducing to changing the position pointer of an in-memory buffer most of the time), so there isn't much difference.  Here's before the fix for the 1-KB entries:

      working on small_compressible
      8.34115099907
      8.57280993462
      8.3534219265
      8.55130696297
      8.17362189293
      working on small_uncompressible_uncompressed
      7.85155582428
      7.54075288773
      7.50106596947
      7.39202189445
      7.95735621452
      working on small_uncompressible
      7.89256501198
      7.88875198364
      7.9013261795
      7.76551413536
      7.84927678108
      

      and after:

      working on small_compressible
      8.29225707054
      7.57822394371
      8.10092878342
      8.21332192421
      8.19347810745
      working on small_uncompressible_uncompressed
      7.74823594093
      7.81218004227
      7.68660092354
      7.95432114601
      7.77612304688
      working on small_uncompressible
      8.18260502815
      8.21010804176
      8.1233921051
      7.31543707848
      7.91079998016
      

      The effect is similar for the 25-KB entries, which might enjoy a slight performance benefit from the patch (perhaps because they're larger than the default buffer size defined in RandomAccessReader).  Before:

      working on medium_compressible
      0.988080978394
      1.02464294434
      0.977658033371
      1.02553391457
      0.769363880157
      working on medium_uncompressible_uncompressed
      1.07718396187
      1.08547902107
      1.12398791313
      1.10300898552
      1.08757281303
      working on medium_uncompressible
      0.940990209579
      0.917474985123
      0.768013954163
      0.871683835983
      0.814841985703
      

      and after:

      working on medium_compressible
      0.829009056091
      0.705173015594
      0.603646993637
      0.820069074631
      0.873830080032
      working on medium_uncompressible_uncompressed
      0.785156965256
      0.808106184006
      0.848286151886
      0.857885837555
      0.825689077377
      working on medium_uncompressible
      0.845101118088
      0.913790941238
      0.824147939682
      0.849114894867
      0.85981798172
      

      In short, this looks like a pretty straightforward performance win with negligible cost.  (It's worth noting that for our use case, disabling sstable compression is clearly the best solution, but there's still reasonably clear benefit from this minor fix for data sets with larger, compressible values, as well as presumably data sets with a mix of compressible and uncompressible values in environments where storage is limited.)

      Attachments

        Activity

          People

            sklock Samuel Klock
            sklock Samuel Klock
            Samuel Klock
            Benedict Elliott Smith, Benjamin Lerer, Kurt Greaves
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: