[CASSANDRA-7688] Add data sizing to a system table - ASF JIRA

Piotr Kolaczkowski added a comment - 02/Jun/15 11:28 - edited

I double checked the cassandra.size_recorder_interval option does indeed work, but because many tests are creating fresh keyspaces.tables, the interval of 1 second is still too large and the test manages to run before the estimates are created. We'd need to put Thread.sleep before all the tests, which we're not going to do, because it would significantly increase runtime of the whole test suite. So far we just removed the warning about missing estimates. But it would be nice if C* filled those estimate entries on table creation (even with zeroes). There is a difference between "we don't know the estimates" and "we know there are no data".

Piotr Kolaczkowski added a comment - 02/Jun/15 11:28 - edited I double checked the cassandra.size_recorder_interval option does indeed work, but because many tests are creating fresh keyspaces.tables, the interval of 1 second is still too large and the test manages to run before the estimates are created. We'd need to put Thread.sleep before all the tests, which we're not going to do, because it would significantly increase runtime of the whole test suite. So far we just removed the warning about missing estimates. But it would be nice if C* filled those estimate entries on table creation (even with zeroes). There is a difference between "we don't know the estimates" and "we know there are no data".

Piotr Kolaczkowski added a comment - 02/Jun/15 07:40

So what is this JMX call to refresh the estimates?
I tried setting -Dcassandra.size_recorder_interval=1 (I know, ridiculously low value) and the size estimates table is still empty when running the tests.

Piotr Kolaczkowski added a comment - 02/Jun/15 07:40 So what is this JMX call to refresh the estimates? I tried setting -Dcassandra.size_recorder_interval=1 (I know, ridiculously low value) and the size estimates table is still empty when running the tests.

Aleksey Yeschenko added a comment - 03/Apr/15 18:57

There most definitely won't be a separate CQL command just for that, but when we switch this to a virtual table implementation (when we have those) it might be as simple as {{UPDATE}}ing a boolean field in that table to trigger recalc.

We could temporarily add a JMX method. Or you could set the interval to be really low for now, and add some sleep.

I know it's a bit ugly, but it's just an interim measure.

Aleksey Yeschenko added a comment - 03/Apr/15 18:57 There most definitely won't be a separate CQL command just for that, but when we switch this to a virtual table implementation (when we have those) it might be as simple as {{UPDATE}}ing a boolean field in that table to trigger recalc. We could temporarily add a JMX method. Or you could set the interval to be really low for now, and add some sleep. I know it's a bit ugly, but it's just an interim measure.

Piotr Kolaczkowski added a comment - 03/Apr/15 10:58

Will there be a command to manually refresh statistics of a table from CQL (like "ANALYZE TABLE ...")?
I need a way to trigger this in an integration test and I don't want to wait until it automatically refreshes it after the update interval...
1. create table
2. add data
3. analyze
4. check stats

Piotr Kolaczkowski added a comment - 03/Apr/15 10:58 Will there be a command to manually refresh statistics of a table from CQL (like "ANALYZE TABLE ...")? I need a way to trigger this in an integration test and I don't want to wait until it automatically refreshes it after the update interval... 1. create table 2. add data 3. analyze 4. check stats

Piotr Kolaczkowski added a comment - 03/Apr/15 08:03 - edited

So I must have had a dump saved by an early development branch then. Thanks for the clarification.

Piotr Kolaczkowski added a comment - 03/Apr/15 08:03 - edited So I must have had a dump saved by an early development branch then. Thanks for the clarification.

Aleksey Yeschenko added a comment - 02/Apr/15 21:01

Remembered now. It was committed to 2.1.3, but population was disabled before the release. So the table is still there, it's just that there is no actual sizing dumps enabled until 2.1.5.

Aleksey Yeschenko added a comment - 02/Apr/15 21:01 Remembered now. It was committed to 2.1.3, but population was disabled before the release. So the table is still there, it's just that there is no actual sizing dumps enabled until 2.1.5.

Aleksey Yeschenko added a comment - 02/Apr/15 20:54

Must be magic then.

Aleksey Yeschenko added a comment - 02/Apr/15 20:54 Must be magic then.

Philip Thompson added a comment - 02/Apr/15 20:51

I'm using ccm, so the data dirs are being created from scratch

Philip Thompson added a comment - 02/Apr/15 20:51 I'm using ccm, so the data dirs are being created from scratch

Aleksey Yeschenko added a comment - 02/Apr/15 20:49

You probably just have schema left from running 2.1-head.

Aleksey Yeschenko added a comment - 02/Apr/15 20:49 You probably just have schema left from running 2.1-head.

Philip Thompson added a comment - 02/Apr/15 20:46

I see the system.size_estimates table in 2.1.4, but I don't see it being populated. Are you?

Philip Thompson added a comment - 02/Apr/15 20:46 I see the system.size_estimates table in 2.1.4, but I don't see it being populated. Are you?

Piotr Kolaczkowski added a comment - 02/Apr/15 20:13

Why is this ticket marked as fixed in 2.1.5, if I can see this working in just released 2.1.4?

Piotr Kolaczkowski added a comment - 02/Apr/15 20:13 Why is this ticket marked as fixed in 2.1.5, if I can see this working in just released 2.1.4?

Aleksey Yeschenko added a comment - 10/Feb/15 01:36

Re-enabled (for 2.1.4) with the following minor changes:

properly handle empty intervalTree in DT.View#sstablesInBounds
switch gc gs to 0 for system.size_estimates table
add a -D param to alter the estimates update interval
log the duration of the estimation process

Aleksey Yeschenko added a comment - 10/Feb/15 01:36 Re-enabled (for 2.1.4) with the following minor changes: properly handle empty intervalTree in DT.View#sstablesInBounds switch gc gs to 0 for system.size_estimates table add a -D param to alter the estimates update interval log the duration of the estimation process

Aleksey Yeschenko added a comment - 09/Feb/15 21:50

Disabled it for now with fd6f9c6f9c15ab28d0db0edef1f84faaa7ea42c5.

I have a simple fix for DataTracker, but want to investigate it further, so pushing it to 2.1.4.

Sorry for the inconvenience.

Aleksey Yeschenko added a comment - 09/Feb/15 21:50 Disabled it for now with fd6f9c6f9c15ab28d0db0edef1f84faaa7ea42c5. I have a simple fix for DataTracker, but want to investigate it further, so pushing it to 2.1.4. Sorry for the inconvenience.

Michael Shuler added a comment - 09/Feb/15 21:12

This caused a regression:
from: http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/470/testReport/
to: http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/472/testReport/

repro with the bootstrap_test.py dtest: vnodes vs no-vnodes:

(master)mshuler@hana:~/git/cassandra-dtest$ nosetests -vs bootstrap_test.py 
read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap) ... Created keyspaces. Sleeping 1s for propagation.
Warming up WRITE with 50000 iterations...
INFO  19:17:03 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
INFO  19:17:03 New Cassandra host /127.0.0.2:9042 added
Connected to cluster: test
Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1
INFO  19:17:03 New Cassandra host /127.0.0.3:9042 added
Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1
INFO  19:17:03 New Cassandra host /127.0.0.1:9042 added
Failed to connect over JMX; not collecting these stats
Sleeping 2s...
Running WRITE with 8 threads for 10000 iteration
Failed to connect over JMX; not collecting these stats
total ops , adj row/s,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr,  gc: #,  max ms,  sum ms,  sdv ms,      mb
2403      ,      2403,    2403,    2403,    2403,     3.3,     2.0,    10.0,    16.2,    23.8,    27.3,    1.0,  0.00000,      0,       0,       0,       0,       0
4231      ,      1806,    1806,    1806,    1806,     4.4,     2.1,    16.2,    27.0,    67.2,    72.5,    2.0,  0.00000,      0,       0,       0,       0,       0
6796      ,      2624,    2534,    2534,    2534,     3.1,     1.9,     9.0,    14.6,    49.3,    50.5,    3.0,  0.10034,      0,       0,       0,       0,       0
9449      ,      2684,    2627,    2627,    2627,     3.0,     1.9,     8.8,    14.5,    35.1,    36.7,    4.0,  0.08758,      0,       0,       0,       0,       0
10000     ,      2395,    2395,    2395,    2395,     3.3,     1.8,    10.0,    26.6,    48.2,    48.2,    4.3,  0.07295,      0,       0,       0,       0,       0


Results:
op rate                   : 2345
partition rate            : 2345
row rate                  : 2345
latency mean              : 3.4
latency median            : 1.9
latency 95th percentile   : 10.4
latency 99th percentile   : 19.7
latency 99.9th percentile : 42.6
latency max               : 72.5
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:00:04
END
ok
simple_bootstrap_test (bootstrap_test.TestBootstrap) ... ok

----------------------------------------------------------------------
Ran 2 tests in 230.646s

OK

(master)mshuler@hana:~/git/cassandra-dtest$ export DISABLE_VNODES=true ; nosetests -vs bootstrap_test.py 
read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap) ... Created keyspaces. Sleeping 1s for propagation.
Warming up WRITE with 50000 iterations...
INFO  19:21:20 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
Connected to cluster: test
INFO  19:21:20 New Cassandra host /127.0.0.3:9042 added
Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1
INFO  19:21:20 New Cassandra host /127.0.0.2:9042 added
Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1
INFO  19:21:20 New Cassandra host /127.0.0.1:9042 added
Failed to connect over JMX; not collecting these stats
Sleeping 2s...
Running WRITE with 8 threads for 10000 iteration
Failed to connect over JMX; not collecting these stats
total ops , adj row/s,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr,  gc: #,  max ms,  sum ms,  sdv ms,      mb
6145      ,      6143,    6143,    6143,    6143,     1.3,     0.9,     3.1,     6.9,    18.5,    34.1,    1.0,  0.00000,      0,       0,       0,       0,       0
10000     ,      7485,    7485,    7485,    7485,     1.0,     0.7,     2.6,     4.9,    11.0,    16.3,    1.5,  0.00000,      0,       0,       0,       0,       0


Results:
op rate                   : 6599
partition rate            : 6599
row rate                  : 6599
latency mean              : 1.2
latency median            : 0.8
latency 95th percentile   : 2.9
latency 99th percentile   : 6.2
latency 99.9th percentile : 14.9
latency max               : 34.1
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:00:01
END
ERROR
simple_bootstrap_test (bootstrap_test.TestBootstrap) ... ERROR

======================================================================
ERROR: read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mshuler/git/cassandra-dtest/dtest.py", line 426, in tearDown
    raise AssertionError('Unexpected error in %s node log: %s' % (node.name, errors))
AssertionError: Unexpected error in node1 node log: ['ERROR [OptionalTasks:1] 2015-02-09 13:21:38,131 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main]\n']
-------------------- >> begin captured logging << --------------------
dtest: DEBUG: cluster ccm directory: /tmp/dtest-WychZv
cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.4:
Traceback (most recent call last):
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1891, in _reconnect_internal
    return self._try_connect(host)
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1908, in _try_connect
    connection = self._cluster.connection_factory(host.address, is_control_connection=True)
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory
    return self.connection_class.factory(address, *args, **kwargs)
  File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory
    conn = cls(*args, **kwargs)
  File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__
    raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror))
error: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster:
Traceback (most recent call last):
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 754, in connect
    self.control_connection.connect()
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1865, in connect
    self._set_new_connection(self._reconnect_internal())
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1900, in _reconnect_internal
    raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.4': error(111, "Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused")})
dtest: DEBUG: removing ccm cluster test at: /tmp/dtest-WychZv
--------------------- >> end captured logging << ---------------------

======================================================================
ERROR: simple_bootstrap_test (bootstrap_test.TestBootstrap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mshuler/git/cassandra-dtest/dtest.py", line 426, in tearDown
    raise AssertionError('Unexpected error in %s node log: %s' % (node.name, errors))
AssertionError: Unexpected error in node1 node log: ['ERROR [OptionalTasks:1] 2015-02-09 13:23:29,505 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main]\n']
-------------------- >> begin captured logging << --------------------
dtest: DEBUG: cluster ccm directory: /tmp/dtest-VMfxCJ
cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.1:
Traceback (most recent call last):
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1891, in _reconnect_internal
    return self._try_connect(host)
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1908, in _try_connect
    connection = self._cluster.connection_factory(host.address, is_control_connection=True)
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory
    return self.connection_class.factory(address, *args, **kwargs)
  File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory
    conn = cls(*args, **kwargs)
  File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__
    raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror))
error: [Errno 111] Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused
cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster:
Traceback (most recent call last):
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 754, in connect
    self.control_connection.connect()
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1865, in connect
    self._set_new_connection(self._reconnect_internal())
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1900, in _reconnect_internal
    raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
cassandra.cluster: WARNING: Host 127.0.0.4 has been marked down
cassandra.cluster: WARNING: Host 127.0.0.3 has been marked down
cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down
cassandra.cluster: WARNING: Host 127.0.0.1 has been marked down
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
cassandra.cluster: WARNING: Failed to create connection pool for new host 127.0.0.2:
Traceback (most recent call last):
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1612, in run_add_or_renew_pool
    new_pool = HostConnection(host, distance, self)
  File "/home/mshuler/git/python-driver/cassandra/pool.py", line 294, in __init__
    self._connection = session.cluster.connection_factory(host.address)
  File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory
    return self.connection_class.factory(address, *args, **kwargs)
  File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory
    conn = cls(*args, **kwargs)
  File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__
    raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror))
error: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
cassandra.cluster: WARNING: Connection pool could not be created, not marking node 127.0.0.2 up
cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
cassandra.cluster: WARNING: Host 127.0.0.1 has been marked down
cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down
dtest: DEBUG: removing ccm cluster test at: /tmp/dtest-VMfxCJ
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 2 tests in 220.323s

FAILED (errors=2)

Throws this in the node1.log:

ERROR [OptionalTasks:1] 2015-02-09 13:23:29,505 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main]
java.lang.IllegalStateException: null
        at org.apache.cassandra.utils.IntervalTree.max(IntervalTree.java:121) ~[main/:na]
        at org.apache.cassandra.db.DataTracker$View.sstablesInBounds(DataTracker.java:761) ~[main/:na]
        at org.apache.cassandra.db.ColumnFamilyStore$6.apply(ColumnFamilyStore.java:1862) ~[main/:na]
        at org.apache.cassandra.db.ColumnFamilyStore$6.apply(ColumnFamilyStore.java:1859) ~[main/:na]
        at org.apache.cassandra.db.SizeEstimatesRecorder.recordSizeEstimates(SizeEstimatesRecorder.java:79) ~[main/:na]
        at org.apache.cassandra.db.SizeEstimatesRecorder.run(SizeEstimatesRecorder.java:65) ~[main/:na]
        at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:82) ~[main/:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_76]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_76]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_76]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_76]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_76]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_76]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_76]

Michael Shuler added a comment - 09/Feb/15 21:12 This caused a regression: from: http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/470/testReport/ to: http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/472/testReport/ repro with the bootstrap_test.py dtest: vnodes vs no-vnodes: (master)mshuler@hana:~/git/cassandra-dtest$ nosetests -vs bootstrap_test.py read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap) ... Created keyspaces. Sleeping 1s for propagation. Warming up WRITE with 50000 iterations... INFO 19:17:03 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) INFO 19:17:03 New Cassandra host /127.0.0.2:9042 added Connected to cluster: test Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1 INFO 19:17:03 New Cassandra host /127.0.0.3:9042 added Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1 Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1 INFO 19:17:03 New Cassandra host /127.0.0.1:9042 added Failed to connect over JMX; not collecting these stats Sleeping 2s... Running WRITE with 8 threads for 10000 iteration Failed to connect over JMX; not collecting these stats total ops , adj row/s, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, gc: #, max ms, sum ms, sdv ms, mb 2403 , 2403, 2403, 2403, 2403, 3.3, 2.0, 10.0, 16.2, 23.8, 27.3, 1.0, 0.00000, 0, 0, 0, 0, 0 4231 , 1806, 1806, 1806, 1806, 4.4, 2.1, 16.2, 27.0, 67.2, 72.5, 2.0, 0.00000, 0, 0, 0, 0, 0 6796 , 2624, 2534, 2534, 2534, 3.1, 1.9, 9.0, 14.6, 49.3, 50.5, 3.0, 0.10034, 0, 0, 0, 0, 0 9449 , 2684, 2627, 2627, 2627, 3.0, 1.9, 8.8, 14.5, 35.1, 36.7, 4.0, 0.08758, 0, 0, 0, 0, 0 10000 , 2395, 2395, 2395, 2395, 3.3, 1.8, 10.0, 26.6, 48.2, 48.2, 4.3, 0.07295, 0, 0, 0, 0, 0 Results: op rate : 2345 partition rate : 2345 row rate : 2345 latency mean : 3.4 latency median : 1.9 latency 95th percentile : 10.4 latency 99th percentile : 19.7 latency 99.9th percentile : 42.6 latency max : 72.5 total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:00:04 END ok simple_bootstrap_test (bootstrap_test.TestBootstrap) ... ok ---------------------------------------------------------------------- Ran 2 tests in 230.646s OK (master)mshuler@hana:~/git/cassandra-dtest$ export DISABLE_VNODES=true ; nosetests -vs bootstrap_test.py read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap) ... Created keyspaces. Sleeping 1s for propagation. Warming up WRITE with 50000 iterations... INFO 19:21:20 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) Connected to cluster: test INFO 19:21:20 New Cassandra host /127.0.0.3:9042 added Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1 Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1 INFO 19:21:20 New Cassandra host /127.0.0.2:9042 added Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1 INFO 19:21:20 New Cassandra host /127.0.0.1:9042 added Failed to connect over JMX; not collecting these stats Sleeping 2s... Running WRITE with 8 threads for 10000 iteration Failed to connect over JMX; not collecting these stats total ops , adj row/s, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, gc: #, max ms, sum ms, sdv ms, mb 6145 , 6143, 6143, 6143, 6143, 1.3, 0.9, 3.1, 6.9, 18.5, 34.1, 1.0, 0.00000, 0, 0, 0, 0, 0 10000 , 7485, 7485, 7485, 7485, 1.0, 0.7, 2.6, 4.9, 11.0, 16.3, 1.5, 0.00000, 0, 0, 0, 0, 0 Results: op rate : 6599 partition rate : 6599 row rate : 6599 latency mean : 1.2 latency median : 0.8 latency 95th percentile : 2.9 latency 99th percentile : 6.2 latency 99.9th percentile : 14.9 latency max : 34.1 total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:00:01 END ERROR simple_bootstrap_test (bootstrap_test.TestBootstrap) ... ERROR ====================================================================== ERROR: read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/mshuler/git/cassandra-dtest/dtest.py", line 426, in tearDown raise AssertionError('Unexpected error in %s node log: %s' % (node.name, errors)) AssertionError: Unexpected error in node1 node log: ['ERROR [OptionalTasks:1] 2015-02-09 13:21:38,131 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main]\n'] -------------------- >> begin captured logging << -------------------- dtest: DEBUG: cluster ccm directory: /tmp/dtest-WychZv cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.4: Traceback (most recent call last): File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1891, in _reconnect_internal return self._try_connect(host) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1908, in _try_connect connection = self._cluster.connection_factory(host.address, is_control_connection=True) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory return self.connection_class.factory(address, *args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory conn = cls(*args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__ raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror)) error: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster: Traceback (most recent call last): File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 754, in connect self.control_connection.connect() File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1865, in connect self._set_new_connection(self._reconnect_internal()) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1900, in _reconnect_internal raise NoHostAvailable("Unable to connect to any servers", errors) NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.4': error(111, "Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused")}) dtest: DEBUG: removing ccm cluster test at: /tmp/dtest-WychZv --------------------- >> end captured logging << --------------------- ====================================================================== ERROR: simple_bootstrap_test (bootstrap_test.TestBootstrap) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/mshuler/git/cassandra-dtest/dtest.py", line 426, in tearDown raise AssertionError('Unexpected error in %s node log: %s' % (node.name, errors)) AssertionError: Unexpected error in node1 node log: ['ERROR [OptionalTasks:1] 2015-02-09 13:23:29,505 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main]\n'] -------------------- >> begin captured logging << -------------------- dtest: DEBUG: cluster ccm directory: /tmp/dtest-VMfxCJ cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.1: Traceback (most recent call last): File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1891, in _reconnect_internal return self._try_connect(host) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1908, in _try_connect connection = self._cluster.connection_factory(host.address, is_control_connection=True) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory return self.connection_class.factory(address, *args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory conn = cls(*args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__ raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror)) error: [Errno 111] Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster: Traceback (most recent call last): File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 754, in connect self.control_connection.connect() File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1865, in connect self._set_new_connection(self._reconnect_internal()) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1900, in _reconnect_internal raise NoHostAvailable("Unable to connect to any servers", errors) NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")}) cassandra.cluster: WARNING: Host 127.0.0.4 has been marked down cassandra.cluster: WARNING: Host 127.0.0.3 has been marked down cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down cassandra.cluster: WARNING: Host 127.0.0.1 has been marked down cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.cluster: WARNING: Failed to create connection pool for new host 127.0.0.2: Traceback (most recent call last): File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1612, in run_add_or_renew_pool new_pool = HostConnection(host, distance, self) File "/home/mshuler/git/python-driver/cassandra/pool.py", line 294, in __init__ self._connection = session.cluster.connection_factory(host.address) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory return self.connection_class.factory(address, *args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory conn = cls(*args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__ raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror)) error: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.cluster: WARNING: Connection pool could not be created, not marking node 127.0.0.2 up cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.cluster: WARNING: Host 127.0.0.1 has been marked down cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down dtest: DEBUG: removing ccm cluster test at: /tmp/dtest-VMfxCJ --------------------- >> end captured logging << --------------------- ---------------------------------------------------------------------- Ran 2 tests in 220.323s FAILED (errors=2) Throws this in the node1.log: ERROR [OptionalTasks:1] 2015-02-09 13:23:29,505 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main] java.lang.IllegalStateException: null at org.apache.cassandra.utils.IntervalTree.max(IntervalTree.java:121) ~[main/:na] at org.apache.cassandra.db.DataTracker$View.sstablesInBounds(DataTracker.java:761) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore$6.apply(ColumnFamilyStore.java:1862) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore$6.apply(ColumnFamilyStore.java:1859) ~[main/:na] at org.apache.cassandra.db.SizeEstimatesRecorder.recordSizeEstimates(SizeEstimatesRecorder.java:79) ~[main/:na] at org.apache.cassandra.db.SizeEstimatesRecorder.run(SizeEstimatesRecorder.java:65) ~[main/:na] at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:82) ~[main/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_76] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_76] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_76] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_76] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_76] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_76] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_76]

Aleksey Yeschenko added a comment - 06/Feb/15 12:57

All good points. I'll add a -D option to change the interval/disable it.

Aleksey Yeschenko added a comment - 06/Feb/15 12:57 All good points. I'll add a -D option to change the interval/disable it.

Michael Semb Wever added a comment - 06/Feb/15 12:54 - edited

You are quoting the wrong code here, but how do you not background it?

yes i can see that it's not possible at the moment. (i didn't realise that at first, but it really wasn't my train of thought either).

When we add vtable support (cql tables backed by classes, not sstables) - then we'll switch sizing (and several other system sstables) to that.

niceto know. thanks.

This is a simple temporary replacement for describe_splits_ex, its only goal is to free Spark and others from having to maintain an extra Thrift connection now. Hence the lack of metrics or configurability of the refresh interval.

I'm open to increasing/decreasing the hard-coded one, however, if you have better options.

i have no suggestion.
i'm more concerned/curious as to why "5 minutes"?
if there's no good answer then isn't metrics important?
and being able to configure it.

quick examples that come to mind:

what if an installation has lots of jobs built upon each others data and for them there's a strong benefit (if not a requirement) for more accurate sizes (ie faster schedule rate),
what if there's bugs/load caused from this that can be avoided (for an installation that doesn't ever use hadoop/spark) by configuring it to zero (disabling), giving an immediate option to upgrading-to/waiting-for next version.

Michael Semb Wever added a comment - 06/Feb/15 12:54 - edited You are quoting the wrong code here, but how do you not background it? yes i can see that it's not possible at the moment. (i didn't realise that at first, but it really wasn't my train of thought either). When we add vtable support (cql tables backed by classes, not sstables) - then we'll switch sizing (and several other system sstables) to that. niceto know. thanks. This is a simple temporary replacement for describe_splits_ex, its only goal is to free Spark and others from having to maintain an extra Thrift connection now. Hence the lack of metrics or configurability of the refresh interval. I'm open to increasing/decreasing the hard-coded one, however, if you have better options. i have no suggestion. i'm more concerned/curious as to why "5 minutes"? if there's no good answer then isn't metrics important? and being able to configure it. quick examples that come to mind: what if an installation has lots of jobs built upon each others data and for them there's a strong benefit (if not a requirement) for more accurate sizes (ie faster schedule rate), what if there's bugs/load caused from this that can be avoided (for an installation that doesn't ever use hadoop/spark) by configuring it to zero (disabling), giving an immediate option to upgrading-to/waiting-for next version.

Aleksey Yeschenko added a comment - 06/Feb/15 12:23

You are quoting the wrong code here, but how do you not background it? It's not strictly about cost, it's about not having any other triggering mechanism.

When we add vtable support (cql tables backed by classes, not sstables) - then we'll switch sizing (and several other system sstables) to that. Until then, what other options do we have?

This is a simple temporary replacement for describe_splits_ex, its only goal is to free Spark and others from having to maintain an extra Thrift connection now. Hence the lack of metrics or configurability of the refresh interval.

I'm open to increasing/decreasing the hard-coded one, however, if you have better options.

Aleksey Yeschenko added a comment - 06/Feb/15 12:23 You are quoting the wrong code here, but how do you not background it? It's not strictly about cost, it's about not having any other triggering mechanism. When we add vtable support (cql tables backed by classes, not sstables) - then we'll switch sizing (and several other system sstables) to that. Until then, what other options do we have? This is a simple temporary replacement for describe_splits_ex, its only goal is to free Spark and others from having to maintain an extra Thrift connection now . Hence the lack of metrics or configurability of the refresh interval. I'm open to increasing/decreasing the hard-coded one, however, if you have better options.

Michael Semb Wever added a comment - 06/Feb/15 06:35 - edited

Can you please elaborate on what the idea is behind storing this info in a system table?

I'm still curious on this question, as it wasn't about the removal of thrift (that's obvious, although it wasn't obvious that all "metadata" is only exposed via cql, eg ControlConnection.refreshSchema(..)) but around the reasoning for backgrounding/frequency-of the computation.

        ScheduledExecutors.optionalTasks.schedule(runnable, 5, TimeUnit.MINUTES);

Why 5 minutes? What's the trade-off here?
How do we (everyone) know the computation is expensive enough to warrant backgrounding it?
And that 5 minutes will give us the best throughput (across c* and its hadoop/spark jobs)?

a) what about putting metrics around the code in SizeEstimatesRecorder.run() so we can get an idea for future adjustments?
(going a step further could be do get updateSizeEstimates() to diff the old rows with new rows and having a metric on change frequency).

b) what about making the frequency configurable?

Michael Semb Wever added a comment - 06/Feb/15 06:35 - edited Can you please elaborate on what the idea is behind storing this info in a system table? I'm still curious on this question, as it wasn't about the removal of thrift (that's obvious, although it wasn't obvious that all "metadata" is only exposed via cql, eg ControlConnection.refreshSchema(..)) but around the reasoning for backgrounding/frequency-of the computation. ScheduledExecutors.optionalTasks.schedule(runnable, 5, TimeUnit.MINUTES); Why 5 minutes? What's the trade-off here? How do we (everyone) know the computation is expensive enough to warrant backgrounding it? And that 5 minutes will give us the best throughput (across c* and its hadoop/spark jobs)? a) what about putting metrics around the code in SizeEstimatesRecorder.run() so we can get an idea for future adjustments? (going a step further could be do get updateSizeEstimates() to diff the old rows with new rows and having a metric on change frequency). b) what about making the frequency configurable?

Aleksey Yeschenko added a comment - 05/Feb/15 13:37

Committed, thanks.

Aleksey Yeschenko added a comment - 05/Feb/15 13:37 Committed, thanks.

Piotr Kolaczkowski added a comment - 05/Feb/15 13:01

Ok, +1 then.

Piotr Kolaczkowski added a comment - 05/Feb/15 13:01 Ok, +1 then.

Aleksey Yeschenko added a comment - 05/Feb/15 12:03

Since it's a single partition update, the whole thing is atomic and isolated, yes. I'm adding updates to the mutation one by one, but applying everything, including the removal of previous state, and addition of the new data, in one go, at mutation.apply() point.

So long as you fetch all the ranges together in one query, you'll always have a complete state. It might be slightly out of date and lagging behind (rare) topology updates for up to 5 minutes, but it'll always be internally consistent.

Aleksey Yeschenko added a comment - 05/Feb/15 12:03 Since it's a single partition update, the whole thing is atomic and isolated, yes. I'm adding updates to the mutation one by one, but applying everything, including the removal of previous state, and addition of the new data, in one go, at mutation.apply() point. So long as you fetch all the ranges together in one query, you'll always have a complete state. It might be slightly out of date and lagging behind (rare) topology updates for up to 5 minutes, but it'll always be internally consistent.

Piotr Kolaczkowski added a comment - 05/Feb/15 11:56

Looks good.

       // delete all previous values with a single range tombstone.
        mutation.deleteRange(SIZE_ESTIMATES_CF,
                             estimatesTable.comparator.make(table).start(),
                             estimatesTable.comparator.make(table).end(),
                             timestamp - 1);

        // add a CQL row for each primary token range.
        ColumnFamily cells = mutation.addOrGet(estimatesTable);
        for (Map.Entry<Range<Token>, Pair<Long, Long>> entry : estimates.entrySet())
        {
            Range<Token> range = entry.getKey();
            Pair<Long, Long> values = entry.getValue();
            Composite prefix = estimatesTable.comparator.make(table, range.left.toString(), range.right.toString());
            CFRowAdder adder = new CFRowAdder(cells, prefix, timestamp);
            adder.add("partitions_count", values.left)
                 .add("mean_partition_size", values.right);
        }

        mutation.apply();

Are updates of the table atomic? I can see you delete a whole bunch of token ranges with one tombstone and than add one by one. Is it possible to get an incomplete table when querying at the wrong moment?

Piotr Kolaczkowski added a comment - 05/Feb/15 11:56 Looks good. // delete all previous values with a single range tombstone. mutation.deleteRange(SIZE_ESTIMATES_CF, estimatesTable.comparator.make(table).start(), estimatesTable.comparator.make(table).end(), timestamp - 1); // add a CQL row for each primary token range. ColumnFamily cells = mutation.addOrGet(estimatesTable); for (Map.Entry<Range<Token>, Pair< Long , Long >> entry : estimates.entrySet()) { Range<Token> range = entry.getKey(); Pair< Long , Long > values = entry.getValue(); Composite prefix = estimatesTable.comparator.make(table, range.left.toString(), range.right.toString()); CFRowAdder adder = new CFRowAdder(cells, prefix, timestamp); adder.add( "partitions_count" , values.left) .add( "mean_partition_size" , values.right); } mutation.apply(); Are updates of the table atomic? I can see you delete a whole bunch of token ranges with one tombstone and than add one by one. Is it possible to get an incomplete table when querying at the wrong moment?

Aleksey Yeschenko added a comment - 03/Feb/15 23:53

Additionally, to reiterate what Sylvain said - we are open to improvements in accuracy, but those aren't trivial, and should go into another ticket.

Aleksey Yeschenko added a comment - 03/Feb/15 23:53 Additionally, to reiterate what Sylvain said - we are open to improvements in accuracy, but those aren't trivial, and should go into another ticket.

Aleksey Yeschenko added a comment - 03/Feb/15 23:50

Attached a simple 2.1-based patch.

Just in case, let me repeat this: this ticket doesn't aim to radically improve, if at all, the current accuracy of describe_splits_ex. Its only explicit goal is to expose that information via CQL.

That, in turn, is required so that Spark connector, and our native hadoop implementations aren't forced to keep both a native protocol connection and a Thrift connection for the sake of making estimation possible at all.

This ticket is a companion to https://datastax-oss.atlassian.net/browse/JAVA-312.

It's simple and non-invasive. However, writing a meaningful unit test for it is hard. It doesn't modify any internals in any way, and it will have some coverage on the spark-connector side. This is my attempt at justifying lack of a unit test for the patch.

Aleksey Yeschenko added a comment - 03/Feb/15 23:50 Attached a simple 2.1-based patch. Just in case, let me repeat this: this ticket doesn't aim to radically improve, if at all, the current accuracy of describe_splits_ex. Its only explicit goal is to expose that information via CQL. That, in turn, is required so that Spark connector, and our native hadoop implementations aren't forced to keep both a native protocol connection and a Thrift connection for the sake of making estimation possible at all. This ticket is a companion to https://datastax-oss.atlassian.net/browse/JAVA-312 . It's simple and non-invasive. However, writing a meaningful unit test for it is hard. It doesn't modify any internals in any way, and it will have some coverage on the spark-connector side. This is my attempt at justifying lack of a unit test for the patch.

Aleksey Yeschenko added a comment - 30/Jan/15 21:32

Was your intent to just periodically execute this function and dump the results into system tables?

Not this exact function, but yes, just periodically dump sizing info there.

Aleksey Yeschenko added a comment - 30/Jan/15 21:32 Was your intent to just periodically execute this function and dump the results into system tables? Not this exact function, but yes, just periodically dump sizing info there.

Matt Byrd added a comment - 30/Jan/15 21:28

So I suppose the reason for suggesting exposing the same call via cql,
was that at least abstractly it was clear what this meant.
I concede that plumbing all this through might not be straightforward.

The problem with putting it in a system table is, what exactly do you put there?

The current computation is a somewhat expensive on demand computation that is generally done relatively rarely.

Was your intent to just periodically execute this function and dump the results into system tables?
Or did you have something different in mind?

Matt Byrd added a comment - 30/Jan/15 21:28 So I suppose the reason for suggesting exposing the same call via cql, was that at least abstractly it was clear what this meant. I concede that plumbing all this through might not be straightforward. The problem with putting it in a system table is, what exactly do you put there? The current computation is a somewhat expensive on demand computation that is generally done relatively rarely. Was your intent to just periodically execute this function and dump the results into system tables? Or did you have something different in mind?

Aleksey Yeschenko added a comment - 28/Jan/15 00:34

The primary goal is to remove the dependency on Thrift, so that the spark connector and our hadoop code don't have to open an extra Thrift connection in addition to the native protocol one. See ~~CASSANDRA-8358~~ for example.

Having that we'd be able to not start Thrift by default, and simpler implementations of spark and hadoop things.

Aleksey Yeschenko added a comment - 28/Jan/15 00:34 The primary goal is to remove the dependency on Thrift, so that the spark connector and our hadoop code don't have to open an extra Thrift connection in addition to the native protocol one. See CASSANDRA-8358 for example. Having that we'd be able to not start Thrift by default, and simpler implementations of spark and hadoop things.

Aleksey Yeschenko added a comment - 28/Jan/15 00:29

I would have thought it’d be easier to just expose the Storage proxy call via cql?

It wouldn't, unless you propose to create an extra CQL statement just for this, which is something that's not gonna happen.

Otherwise you'd need support for virtual tables, and that's 3.1 territory at best.

Aleksey Yeschenko added a comment - 28/Jan/15 00:29 I would have thought it’d be easier to just expose the Storage proxy call via cql? It wouldn't, unless you propose to create an extra CQL statement just for this, which is something that's not gonna happen. Otherwise you'd need support for virtual tables, and that's 3.1 territory at best.

Michael Semb Wever added a comment - 26/Jan/15 09:44

It would be an on-demand calculation that would be moderately expensive.

iamaleksey If the implementation is but a rewrite, i'm also keen on mbyrd question.

Can you please elaborate on what the idea is behind storing this info in a system table?

Michael Semb Wever added a comment - 26/Jan/15 09:44 It would be an on-demand calculation that would be moderately expensive. iamaleksey If the implementation is but a rewrite, i'm also keen on mbyrd question. Can you please elaborate on what the idea is behind storing this info in a system table?

Piotr Kolaczkowski added a comment - 01/Dec/14 17:12

Fair enough. Just saying describe_splits is pretty bad for the reason it is not possible to set some reasonable default for split size. Some users were already pointing that out in our issue tracker.

Piotr Kolaczkowski added a comment - 01/Dec/14 17:12 Fair enough. Just saying describe_splits is pretty bad for the reason it is not possible to set some reasonable default for split size. Some users were already pointing that out in our issue tracker.

Sylvain Lebresne added a comment - 01/Dec/14 16:53

To be clear, the target here is hadoop/spark and we're not looking at doing anything better than what is currently used by thrift describe_splits. Which is based on the sstable stats and that, yes, can be pretty bad for some datasets, but improving it is a goal for another ticket.

Sylvain Lebresne added a comment - 01/Dec/14 16:53 To be clear, the target here is hadoop/spark and we're not looking at doing anything better than what is currently used by thrift describe_splits. Which is based on the sstable stats and that, yes, can be pretty bad for some datasets, but improving it is a goal for another ticket.

Benedict Elliott Smith added a comment - 01/Dec/14 14:24

I'm talking about estimates. We cannot likely even estimate without pretty significant cost. Sampling column counts is pretty easy, but knowing how many cql rows there are for any merged row is not. There are tricks to make it easier, but there are datasets for which the tricks will not work, and any estimate would be complete guesswork without sampling the data.

Benedict Elliott Smith added a comment - 01/Dec/14 14:24 I'm talking about estimates. We cannot likely even estimate without pretty significant cost. Sampling column counts is pretty easy, but knowing how many cql rows there are for any merged row is not. There are tricks to make it easier, but there are datasets for which the tricks will not work, and any estimate would be complete guesswork without sampling the data.

Piotr Kolaczkowski added a comment - 01/Dec/14 14:20

We only need estimates, not exact values. Factor 1.5x error is considered an awesome estimate, factor 3x is still fairly good.
Also note that Spark/Hadoop does many token range scans. Maybe collecting some statistics on the fly, during the scans (or during the compaction) would be viable? And running a full compaction to get statistics more accurate - why not? You need to do it anyway to get top speed when scanning data in Spark, because a full table scan is doing kind-of implicit compaction anyway, isn't it?

Also, one more thing - it would be good to have those values per column (sorry for making it even harder, I know it is not an easy task). At least to know that a column is responsible for xx% of data in the table - knowing such thing would make a huge difference when estimating data size, because we're not always fetching all columns and they may vary in size a lot (e.g. collections!). Some sampling on insert would probably be enough.

Piotr Kolaczkowski added a comment - 01/Dec/14 14:20 We only need estimates, not exact values. Factor 1.5x error is considered an awesome estimate, factor 3x is still fairly good. Also note that Spark/Hadoop does many token range scans. Maybe collecting some statistics on the fly, during the scans (or during the compaction) would be viable? And running a full compaction to get statistics more accurate - why not? You need to do it anyway to get top speed when scanning data in Spark, because a full table scan is doing kind-of implicit compaction anyway, isn't it? Also, one more thing - it would be good to have those values per column (sorry for making it even harder, I know it is not an easy task). At least to know that a column is responsible for xx% of data in the table - knowing such thing would make a huge difference when estimating data size, because we're not always fetching all columns and they may vary in size a lot (e.g. collections!). Some sampling on insert would probably be enough.

Benedict Elliott Smith added a comment - 01/Dec/14 14:09

This is a fundamentally difficult problem, and to be answered accurately basically requires a full compaction. We can track or estimate this data for any given sstable easily, and we can estimate the number of overlapping partitions between two sstables (though the accuracy I'm unsure of if we composed this data across many sstables), but we cannot say how many rows within each overlapping partition overlap. The best we could do is probably sample some overlapping partitions to see what proportion of row overlap tends to prevail, and hope it is representative; if we assume a normal distribution of overlap ratio we could return error bounds.

I don't think it's likely this data could be maintained live, at least not accurately, or not without significant cost. It would be an on-demand calculation that would be moderately expensive.

Benedict Elliott Smith added a comment - 01/Dec/14 14:09 This is a fundamentally difficult problem, and to be answered accurately basically requires a full compaction. We can track or estimate this data for any given sstable easily, and we can estimate the number of overlapping partitions between two sstables (though the accuracy I'm unsure of if we composed this data across many sstables), but we cannot say how many rows within each overlapping partition overlap. The best we could do is probably sample some overlapping partitions to see what proportion of row overlap tends to prevail, and hope it is representative; if we assume a normal distribution of overlap ratio we could return error bounds. I don't think it's likely this data could be maintained live, at least not accurately, or not without significant cost. It would be an on-demand calculation that would be moderately expensive.

Piotr Kolaczkowski added a comment - 01/Dec/14 11:58 - edited

It would be nice to know also the average partition size in the given table, both in bytes and in number of CQL rows. This would be useful to set appropriate fetch.size. Additionally, current split generation API does not allow to set split size in terms of data size in bytes or number of CQL rows, but only by number of partitions. Number of partitions doesn't make a nice default, as partitions can vary greatly in size and are extremely use-case dependent. So please, don't just copy current describe_splits_ex functionality to the new driver, but improve this.

We really don't need the driver / Cassandra to do the splitting for us. Instead we need to know:

1. estimate of total amount of data in the table in bytes
2. estimate of total number of CQL rows in the table
3. estimate of total number of partitions in the table

We're interested both in totals (whole cluster; logical sizes; i.e. without replicas), and split by token-ranges by node (physical; incuding replicas).

Note that this information is useful not just for Spark/Hadoop split generation, but also things like e.g. SparkSQL optimizer so it knows how much data will it have to process or to set appropriate fetch sizes when getting data, etc.

The next step would be providing column data histograms to guide predicate selectivity.

Piotr Kolaczkowski added a comment - 01/Dec/14 11:58 - edited It would be nice to know also the average partition size in the given table, both in bytes and in number of CQL rows. This would be useful to set appropriate fetch.size. Additionally, current split generation API does not allow to set split size in terms of data size in bytes or number of CQL rows, but only by number of partitions. Number of partitions doesn't make a nice default, as partitions can vary greatly in size and are extremely use-case dependent. So please, don't just copy current describe_splits_ex functionality to the new driver, but improve this . We really don't need the driver / Cassandra to do the splitting for us. Instead we need to know: 1. estimate of total amount of data in the table in bytes 2. estimate of total number of CQL rows in the table 3. estimate of total number of partitions in the table We're interested both in totals (whole cluster; logical sizes; i.e. without replicas), and split by token-ranges by node (physical; incuding replicas). Note that this information is useful not just for Spark/Hadoop split generation, but also things like e.g. SparkSQL optimizer so it knows how much data will it have to process or to set appropriate fetch sizes when getting data, etc. The next step would be providing column data histograms to guide predicate selectivity.

Matt Byrd added a comment - 08/Aug/14 21:20

Originally I was just thinking of exposing the same method available in thrift, via some cql syntax i.e:
essentially from StorageProxy:
public List<Pair<Range<Token>, Long>> getSplits(String keyspaceName, String cfName, Range<Token> range, int keysPerSplit, CFMetaData metadata)

This in turn actually operates on the index intervals in memory, getting appropriately sized splits given the samples taken.

Can you please elaborate on what the idea is behind storing this info in a system table?
It would seem that you would need to keep doing the above computation or something similar and write the result to a system table.
I would have thought it’d be easier to just expose the Storage proxy call via cql?

Matt Byrd added a comment - 08/Aug/14 21:20 Originally I was just thinking of exposing the same method available in thrift, via some cql syntax i.e: essentially from StorageProxy: public List<Pair<Range<Token>, Long>> getSplits(String keyspaceName, String cfName, Range<Token> range, int keysPerSplit, CFMetaData metadata) This in turn actually operates on the index intervals in memory, getting appropriately sized splits given the samples taken. Can you please elaborate on what the idea is behind storing this info in a system table? It would seem that you would need to keep doing the above computation or something similar and write the result to a system table. I would have thought it’d be easier to just expose the Storage proxy call via cql?

Apache Cassandra

Add data sizing to a system table

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates