Details

    • New Feature
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • 2.1.5
    • None
    • None

    Description

      Currently you can't implement something similar to describe_splits_ex purely from the a native protocol driver. https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily getting ownership information to a client in the java-driver. But you still need the data sizing part to get splits of a given size. We should add the sizing information to a system table so that native clients can get to it.

      Attachments

        1. 7688.txt
          12 kB
          Aleksey Yeschenko

        Issue Links

          Activity

            pkolaczk Piotr Kolaczkowski added a comment - - edited

            I double checked the cassandra.size_recorder_interval option does indeed work, but because many tests are creating fresh keyspaces.tables, the interval of 1 second is still too large and the test manages to run before the estimates are created. We'd need to put Thread.sleep before all the tests, which we're not going to do, because it would significantly increase runtime of the whole test suite. So far we just removed the warning about missing estimates. But it would be nice if C* filled those estimate entries on table creation (even with zeroes). There is a difference between "we don't know the estimates" and "we know there are no data".

            pkolaczk Piotr Kolaczkowski added a comment - - edited I double checked the cassandra.size_recorder_interval option does indeed work, but because many tests are creating fresh keyspaces.tables, the interval of 1 second is still too large and the test manages to run before the estimates are created. We'd need to put Thread.sleep before all the tests, which we're not going to do, because it would significantly increase runtime of the whole test suite. So far we just removed the warning about missing estimates. But it would be nice if C* filled those estimate entries on table creation (even with zeroes). There is a difference between "we don't know the estimates" and "we know there are no data".

            So what is this JMX call to refresh the estimates?
            I tried setting -Dcassandra.size_recorder_interval=1 (I know, ridiculously low value) and the size estimates table is still empty when running the tests.

            pkolaczk Piotr Kolaczkowski added a comment - So what is this JMX call to refresh the estimates? I tried setting -Dcassandra.size_recorder_interval=1 (I know, ridiculously low value) and the size estimates table is still empty when running the tests.

            There most definitely won't be a separate CQL command just for that, but when we switch this to a virtual table implementation (when we have those) it might be as simple as {{UPDATE}}ing a boolean field in that table to trigger recalc.

            We could temporarily add a JMX method. Or you could set the interval to be really low for now, and add some sleep.

            I know it's a bit ugly, but it's just an interim measure.

            aleksey Aleksey Yeschenko added a comment - There most definitely won't be a separate CQL command just for that, but when we switch this to a virtual table implementation (when we have those) it might be as simple as {{UPDATE}}ing a boolean field in that table to trigger recalc. We could temporarily add a JMX method. Or you could set the interval to be really low for now, and add some sleep. I know it's a bit ugly, but it's just an interim measure.

            Will there be a command to manually refresh statistics of a table from CQL (like "ANALYZE TABLE ...")?
            I need a way to trigger this in an integration test and I don't want to wait until it automatically refreshes it after the update interval...
            1. create table
            2. add data
            3. analyze
            4. check stats

            pkolaczk Piotr Kolaczkowski added a comment - Will there be a command to manually refresh statistics of a table from CQL (like "ANALYZE TABLE ...")? I need a way to trigger this in an integration test and I don't want to wait until it automatically refreshes it after the update interval... 1. create table 2. add data 3. analyze 4. check stats
            pkolaczk Piotr Kolaczkowski added a comment - - edited

            So I must have had a dump saved by an early development branch then. Thanks for the clarification.

            pkolaczk Piotr Kolaczkowski added a comment - - edited So I must have had a dump saved by an early development branch then. Thanks for the clarification.

            Remembered now. It was committed to 2.1.3, but population was disabled before the release. So the table is still there, it's just that there is no actual sizing dumps enabled until 2.1.5.

            aleksey Aleksey Yeschenko added a comment - Remembered now. It was committed to 2.1.3, but population was disabled before the release. So the table is still there, it's just that there is no actual sizing dumps enabled until 2.1.5.

            Must be magic then.

            aleksey Aleksey Yeschenko added a comment - Must be magic then.

            I'm using ccm, so the data dirs are being created from scratch

            philipthompson Philip Thompson added a comment - I'm using ccm, so the data dirs are being created from scratch

            You probably just have schema left from running 2.1-head.

            aleksey Aleksey Yeschenko added a comment - You probably just have schema left from running 2.1-head.

            I see the system.size_estimates table in 2.1.4, but I don't see it being populated. Are you?

            philipthompson Philip Thompson added a comment - I see the system.size_estimates table in 2.1.4, but I don't see it being populated. Are you?

            Why is this ticket marked as fixed in 2.1.5, if I can see this working in just released 2.1.4?

            pkolaczk Piotr Kolaczkowski added a comment - Why is this ticket marked as fixed in 2.1.5, if I can see this working in just released 2.1.4?

            Re-enabled (for 2.1.4) with the following minor changes:

            • properly handle empty intervalTree in DT.View#sstablesInBounds
            • switch gc gs to 0 for system.size_estimates table
            • add a -D param to alter the estimates update interval
            • log the duration of the estimation process
            aleksey Aleksey Yeschenko added a comment - Re-enabled (for 2.1.4) with the following minor changes: properly handle empty intervalTree in DT.View#sstablesInBounds switch gc gs to 0 for system.size_estimates table add a -D param to alter the estimates update interval log the duration of the estimation process

            Disabled it for now with fd6f9c6f9c15ab28d0db0edef1f84faaa7ea42c5.

            I have a simple fix for DataTracker, but want to investigate it further, so pushing it to 2.1.4.

            Sorry for the inconvenience.

            aleksey Aleksey Yeschenko added a comment - Disabled it for now with fd6f9c6f9c15ab28d0db0edef1f84faaa7ea42c5. I have a simple fix for DataTracker, but want to investigate it further, so pushing it to 2.1.4. Sorry for the inconvenience.
            mshuler Michael Shuler added a comment -

            This caused a regression:
            from: http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/470/testReport/
            to: http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/472/testReport/

            repro with the bootstrap_test.py dtest: vnodes vs no-vnodes:

            (master)mshuler@hana:~/git/cassandra-dtest$ nosetests -vs bootstrap_test.py 
            read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap) ... Created keyspaces. Sleeping 1s for propagation.
            Warming up WRITE with 50000 iterations...
            INFO  19:17:03 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
            INFO  19:17:03 New Cassandra host /127.0.0.2:9042 added
            Connected to cluster: test
            Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1
            INFO  19:17:03 New Cassandra host /127.0.0.3:9042 added
            Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1
            Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1
            INFO  19:17:03 New Cassandra host /127.0.0.1:9042 added
            Failed to connect over JMX; not collecting these stats
            Sleeping 2s...
            Running WRITE with 8 threads for 10000 iteration
            Failed to connect over JMX; not collecting these stats
            total ops , adj row/s,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr,  gc: #,  max ms,  sum ms,  sdv ms,      mb
            2403      ,      2403,    2403,    2403,    2403,     3.3,     2.0,    10.0,    16.2,    23.8,    27.3,    1.0,  0.00000,      0,       0,       0,       0,       0
            4231      ,      1806,    1806,    1806,    1806,     4.4,     2.1,    16.2,    27.0,    67.2,    72.5,    2.0,  0.00000,      0,       0,       0,       0,       0
            6796      ,      2624,    2534,    2534,    2534,     3.1,     1.9,     9.0,    14.6,    49.3,    50.5,    3.0,  0.10034,      0,       0,       0,       0,       0
            9449      ,      2684,    2627,    2627,    2627,     3.0,     1.9,     8.8,    14.5,    35.1,    36.7,    4.0,  0.08758,      0,       0,       0,       0,       0
            10000     ,      2395,    2395,    2395,    2395,     3.3,     1.8,    10.0,    26.6,    48.2,    48.2,    4.3,  0.07295,      0,       0,       0,       0,       0
            
            
            Results:
            op rate                   : 2345
            partition rate            : 2345
            row rate                  : 2345
            latency mean              : 3.4
            latency median            : 1.9
            latency 95th percentile   : 10.4
            latency 99th percentile   : 19.7
            latency 99.9th percentile : 42.6
            latency max               : 72.5
            total gc count            : 0
            total gc mb               : 0
            total gc time (s)         : 0
            avg gc time(ms)           : NaN
            stdev gc time(ms)         : 0
            Total operation time      : 00:00:04
            END
            ok
            simple_bootstrap_test (bootstrap_test.TestBootstrap) ... ok
            
            ----------------------------------------------------------------------
            Ran 2 tests in 230.646s
            
            OK
            
            (master)mshuler@hana:~/git/cassandra-dtest$ export DISABLE_VNODES=true ; nosetests -vs bootstrap_test.py 
            read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap) ... Created keyspaces. Sleeping 1s for propagation.
            Warming up WRITE with 50000 iterations...
            INFO  19:21:20 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
            Connected to cluster: test
            INFO  19:21:20 New Cassandra host /127.0.0.3:9042 added
            Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1
            Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1
            INFO  19:21:20 New Cassandra host /127.0.0.2:9042 added
            Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1
            INFO  19:21:20 New Cassandra host /127.0.0.1:9042 added
            Failed to connect over JMX; not collecting these stats
            Sleeping 2s...
            Running WRITE with 8 threads for 10000 iteration
            Failed to connect over JMX; not collecting these stats
            total ops , adj row/s,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr,  gc: #,  max ms,  sum ms,  sdv ms,      mb
            6145      ,      6143,    6143,    6143,    6143,     1.3,     0.9,     3.1,     6.9,    18.5,    34.1,    1.0,  0.00000,      0,       0,       0,       0,       0
            10000     ,      7485,    7485,    7485,    7485,     1.0,     0.7,     2.6,     4.9,    11.0,    16.3,    1.5,  0.00000,      0,       0,       0,       0,       0
            
            
            Results:
            op rate                   : 6599
            partition rate            : 6599
            row rate                  : 6599
            latency mean              : 1.2
            latency median            : 0.8
            latency 95th percentile   : 2.9
            latency 99th percentile   : 6.2
            latency 99.9th percentile : 14.9
            latency max               : 34.1
            total gc count            : 0
            total gc mb               : 0
            total gc time (s)         : 0
            avg gc time(ms)           : NaN
            stdev gc time(ms)         : 0
            Total operation time      : 00:00:01
            END
            ERROR
            simple_bootstrap_test (bootstrap_test.TestBootstrap) ... ERROR
            
            ======================================================================
            ERROR: read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap)
            ----------------------------------------------------------------------
            Traceback (most recent call last):
              File "/home/mshuler/git/cassandra-dtest/dtest.py", line 426, in tearDown
                raise AssertionError('Unexpected error in %s node log: %s' % (node.name, errors))
            AssertionError: Unexpected error in node1 node log: ['ERROR [OptionalTasks:1] 2015-02-09 13:21:38,131 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main]\n']
            -------------------- >> begin captured logging << --------------------
            dtest: DEBUG: cluster ccm directory: /tmp/dtest-WychZv
            cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.4:
            Traceback (most recent call last):
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1891, in _reconnect_internal
                return self._try_connect(host)
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1908, in _try_connect
                connection = self._cluster.connection_factory(host.address, is_control_connection=True)
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory
                return self.connection_class.factory(address, *args, **kwargs)
              File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory
                conn = cls(*args, **kwargs)
              File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__
                raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror))
            error: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
            cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster:
            Traceback (most recent call last):
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 754, in connect
                self.control_connection.connect()
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1865, in connect
                self._set_new_connection(self._reconnect_internal())
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1900, in _reconnect_internal
                raise NoHostAvailable("Unable to connect to any servers", errors)
            NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.4': error(111, "Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused")})
            dtest: DEBUG: removing ccm cluster test at: /tmp/dtest-WychZv
            --------------------- >> end captured logging << ---------------------
            
            ======================================================================
            ERROR: simple_bootstrap_test (bootstrap_test.TestBootstrap)
            ----------------------------------------------------------------------
            Traceback (most recent call last):
              File "/home/mshuler/git/cassandra-dtest/dtest.py", line 426, in tearDown
                raise AssertionError('Unexpected error in %s node log: %s' % (node.name, errors))
            AssertionError: Unexpected error in node1 node log: ['ERROR [OptionalTasks:1] 2015-02-09 13:23:29,505 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main]\n']
            -------------------- >> begin captured logging << --------------------
            dtest: DEBUG: cluster ccm directory: /tmp/dtest-VMfxCJ
            cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.1:
            Traceback (most recent call last):
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1891, in _reconnect_internal
                return self._try_connect(host)
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1908, in _try_connect
                connection = self._cluster.connection_factory(host.address, is_control_connection=True)
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory
                return self.connection_class.factory(address, *args, **kwargs)
              File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory
                conn = cls(*args, **kwargs)
              File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__
                raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror))
            error: [Errno 111] Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused
            cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster:
            Traceback (most recent call last):
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 754, in connect
                self.control_connection.connect()
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1865, in connect
                self._set_new_connection(self._reconnect_internal())
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1900, in _reconnect_internal
                raise NoHostAvailable("Unable to connect to any servers", errors)
            NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
            cassandra.cluster: WARNING: Host 127.0.0.4 has been marked down
            cassandra.cluster: WARNING: Host 127.0.0.3 has been marked down
            cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down
            cassandra.cluster: WARNING: Host 127.0.0.1 has been marked down
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
            cassandra.cluster: WARNING: Failed to create connection pool for new host 127.0.0.2:
            Traceback (most recent call last):
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1612, in run_add_or_renew_pool
                new_pool = HostConnection(host, distance, self)
              File "/home/mshuler/git/python-driver/cassandra/pool.py", line 294, in __init__
                self._connection = session.cluster.connection_factory(host.address)
              File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory
                return self.connection_class.factory(address, *args, **kwargs)
              File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory
                conn = cls(*args, **kwargs)
              File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__
                raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror))
            error: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
            cassandra.cluster: WARNING: Connection pool could not be created, not marking node 127.0.0.2 up
            cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused
            cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused
            cassandra.cluster: WARNING: Host 127.0.0.1 has been marked down
            cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down
            dtest: DEBUG: removing ccm cluster test at: /tmp/dtest-VMfxCJ
            --------------------- >> end captured logging << ---------------------
            
            ----------------------------------------------------------------------
            Ran 2 tests in 220.323s
            
            FAILED (errors=2)
            

            Throws this in the node1.log:

            ERROR [OptionalTasks:1] 2015-02-09 13:23:29,505 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main]
            java.lang.IllegalStateException: null
                    at org.apache.cassandra.utils.IntervalTree.max(IntervalTree.java:121) ~[main/:na]
                    at org.apache.cassandra.db.DataTracker$View.sstablesInBounds(DataTracker.java:761) ~[main/:na]
                    at org.apache.cassandra.db.ColumnFamilyStore$6.apply(ColumnFamilyStore.java:1862) ~[main/:na]
                    at org.apache.cassandra.db.ColumnFamilyStore$6.apply(ColumnFamilyStore.java:1859) ~[main/:na]
                    at org.apache.cassandra.db.SizeEstimatesRecorder.recordSizeEstimates(SizeEstimatesRecorder.java:79) ~[main/:na]
                    at org.apache.cassandra.db.SizeEstimatesRecorder.run(SizeEstimatesRecorder.java:65) ~[main/:na]
                    at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:82) ~[main/:na]
                    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_76]
                    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_76]
                    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_76]
                    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_76]
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_76]
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_76]
                    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_76]
            
            mshuler Michael Shuler added a comment - This caused a regression: from: http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/470/testReport/ to: http://cassci.datastax.com/job/cassandra-2.1_novnode_dtest/472/testReport/ repro with the bootstrap_test.py dtest: vnodes vs no-vnodes: (master)mshuler@hana:~/git/cassandra-dtest$ nosetests -vs bootstrap_test.py read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap) ... Created keyspaces. Sleeping 1s for propagation. Warming up WRITE with 50000 iterations... INFO 19:17:03 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) INFO 19:17:03 New Cassandra host /127.0.0.2:9042 added Connected to cluster: test Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1 INFO 19:17:03 New Cassandra host /127.0.0.3:9042 added Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1 Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1 INFO 19:17:03 New Cassandra host /127.0.0.1:9042 added Failed to connect over JMX; not collecting these stats Sleeping 2s... Running WRITE with 8 threads for 10000 iteration Failed to connect over JMX; not collecting these stats total ops , adj row/s, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, gc: #, max ms, sum ms, sdv ms, mb 2403 , 2403, 2403, 2403, 2403, 3.3, 2.0, 10.0, 16.2, 23.8, 27.3, 1.0, 0.00000, 0, 0, 0, 0, 0 4231 , 1806, 1806, 1806, 1806, 4.4, 2.1, 16.2, 27.0, 67.2, 72.5, 2.0, 0.00000, 0, 0, 0, 0, 0 6796 , 2624, 2534, 2534, 2534, 3.1, 1.9, 9.0, 14.6, 49.3, 50.5, 3.0, 0.10034, 0, 0, 0, 0, 0 9449 , 2684, 2627, 2627, 2627, 3.0, 1.9, 8.8, 14.5, 35.1, 36.7, 4.0, 0.08758, 0, 0, 0, 0, 0 10000 , 2395, 2395, 2395, 2395, 3.3, 1.8, 10.0, 26.6, 48.2, 48.2, 4.3, 0.07295, 0, 0, 0, 0, 0 Results: op rate : 2345 partition rate : 2345 row rate : 2345 latency mean : 3.4 latency median : 1.9 latency 95th percentile : 10.4 latency 99th percentile : 19.7 latency 99.9th percentile : 42.6 latency max : 72.5 total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:00:04 END ok simple_bootstrap_test (bootstrap_test.TestBootstrap) ... ok ---------------------------------------------------------------------- Ran 2 tests in 230.646s OK (master)mshuler@hana:~/git/cassandra-dtest$ export DISABLE_VNODES=true ; nosetests -vs bootstrap_test.py read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap) ... Created keyspaces. Sleeping 1s for propagation. Warming up WRITE with 50000 iterations... INFO 19:21:20 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) Connected to cluster: test INFO 19:21:20 New Cassandra host /127.0.0.3:9042 added Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1 Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1 INFO 19:21:20 New Cassandra host /127.0.0.2:9042 added Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1 INFO 19:21:20 New Cassandra host /127.0.0.1:9042 added Failed to connect over JMX; not collecting these stats Sleeping 2s... Running WRITE with 8 threads for 10000 iteration Failed to connect over JMX; not collecting these stats total ops , adj row/s, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, gc: #, max ms, sum ms, sdv ms, mb 6145 , 6143, 6143, 6143, 6143, 1.3, 0.9, 3.1, 6.9, 18.5, 34.1, 1.0, 0.00000, 0, 0, 0, 0, 0 10000 , 7485, 7485, 7485, 7485, 1.0, 0.7, 2.6, 4.9, 11.0, 16.3, 1.5, 0.00000, 0, 0, 0, 0, 0 Results: op rate : 6599 partition rate : 6599 row rate : 6599 latency mean : 1.2 latency median : 0.8 latency 95th percentile : 2.9 latency 99th percentile : 6.2 latency 99.9th percentile : 14.9 latency max : 34.1 total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:00:01 END ERROR simple_bootstrap_test (bootstrap_test.TestBootstrap) ... ERROR ====================================================================== ERROR: read_from_bootstrapped_node_test (bootstrap_test.TestBootstrap) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/mshuler/git/cassandra-dtest/dtest.py", line 426, in tearDown raise AssertionError('Unexpected error in %s node log: %s' % (node.name, errors)) AssertionError: Unexpected error in node1 node log: ['ERROR [OptionalTasks:1] 2015-02-09 13:21:38,131 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main]\n'] -------------------- >> begin captured logging << -------------------- dtest: DEBUG: cluster ccm directory: /tmp/dtest-WychZv cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.4: Traceback (most recent call last): File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1891, in _reconnect_internal return self._try_connect(host) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1908, in _try_connect connection = self._cluster.connection_factory(host.address, is_control_connection=True) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory return self.connection_class.factory(address, *args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory conn = cls(*args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__ raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror)) error: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster: Traceback (most recent call last): File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 754, in connect self.control_connection.connect() File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1865, in connect self._set_new_connection(self._reconnect_internal()) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1900, in _reconnect_internal raise NoHostAvailable("Unable to connect to any servers", errors) NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.4': error(111, "Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused")}) dtest: DEBUG: removing ccm cluster test at: /tmp/dtest-WychZv --------------------- >> end captured logging << --------------------- ====================================================================== ERROR: simple_bootstrap_test (bootstrap_test.TestBootstrap) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/mshuler/git/cassandra-dtest/dtest.py", line 426, in tearDown raise AssertionError('Unexpected error in %s node log: %s' % (node.name, errors)) AssertionError: Unexpected error in node1 node log: ['ERROR [OptionalTasks:1] 2015-02-09 13:23:29,505 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main]\n'] -------------------- >> begin captured logging << -------------------- dtest: DEBUG: cluster ccm directory: /tmp/dtest-VMfxCJ cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.1: Traceback (most recent call last): File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1891, in _reconnect_internal return self._try_connect(host) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1908, in _try_connect connection = self._cluster.connection_factory(host.address, is_control_connection=True) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory return self.connection_class.factory(address, *args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory conn = cls(*args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__ raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror)) error: [Errno 111] Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster: Traceback (most recent call last): File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 754, in connect self.control_connection.connect() File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1865, in connect self._set_new_connection(self._reconnect_internal()) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1900, in _reconnect_internal raise NoHostAvailable("Unable to connect to any servers", errors) NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")}) cassandra.cluster: WARNING: Host 127.0.0.4 has been marked down cassandra.cluster: WARNING: Host 127.0.0.3 has been marked down cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down cassandra.cluster: WARNING: Host 127.0.0.1 has been marked down cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 8.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 16.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 32.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.cluster: WARNING: Failed to create connection pool for new host 127.0.0.2: Traceback (most recent call last): File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 1612, in run_add_or_renew_pool new_pool = HostConnection(host, distance, self) File "/home/mshuler/git/python-driver/cassandra/pool.py", line 294, in __init__ self._connection = session.cluster.connection_factory(host.address) File "/home/mshuler/git/python-driver/cassandra/cluster.py", line 708, in connection_factory return self.connection_class.factory(address, *args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 162, in factory conn = cls(*args, **kwargs) File "/home/mshuler/git/python-driver/cassandra/io/asyncorereactor.py", line 195, in __init__ raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror)) error: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.cluster: WARNING: Connection pool could not be created, not marking node 127.0.0.2 up cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 2.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.4, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.4', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.3, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.3', 9042)]. Last error: Connection refused cassandra.pool: WARNING: Error attempting to reconnect to 127.0.0.2, scheduling retry in 64.0 seconds: [Errno 111] Tried connecting to [('127.0.0.2', 9042)]. Last error: Connection refused cassandra.cluster: WARNING: Host 127.0.0.1 has been marked down cassandra.cluster: WARNING: Host 127.0.0.2 has been marked down dtest: DEBUG: removing ccm cluster test at: /tmp/dtest-VMfxCJ --------------------- >> end captured logging << --------------------- ---------------------------------------------------------------------- Ran 2 tests in 220.323s FAILED (errors=2) Throws this in the node1.log: ERROR [OptionalTasks:1] 2015-02-09 13:23:29,505 CassandraDaemon.java:167 - Exception in thread Thread[OptionalTasks:1,5,main] java.lang.IllegalStateException: null at org.apache.cassandra.utils.IntervalTree.max(IntervalTree.java:121) ~[main/:na] at org.apache.cassandra.db.DataTracker$View.sstablesInBounds(DataTracker.java:761) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore$6.apply(ColumnFamilyStore.java:1862) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore$6.apply(ColumnFamilyStore.java:1859) ~[main/:na] at org.apache.cassandra.db.SizeEstimatesRecorder.recordSizeEstimates(SizeEstimatesRecorder.java:79) ~[main/:na] at org.apache.cassandra.db.SizeEstimatesRecorder.run(SizeEstimatesRecorder.java:65) ~[main/:na] at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:82) ~[main/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_76] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_76] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_76] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_76] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_76] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_76] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_76]

            All good points. I'll add a -D option to change the interval/disable it.

            aleksey Aleksey Yeschenko added a comment - All good points. I'll add a -D option to change the interval/disable it.
            mck Michael Semb Wever added a comment - - edited

            You are quoting the wrong code here, but how do you not background it?

            yes i can see that it's not possible at the moment. (i didn't realise that at first, but it really wasn't my train of thought either).

            When we add vtable support (cql tables backed by classes, not sstables) - then we'll switch sizing (and several other system sstables) to that.

            niceto know. thanks.

            This is a simple temporary replacement for describe_splits_ex, its only goal is to free Spark and others from having to maintain an extra Thrift connection now. Hence the lack of metrics or configurability of the refresh interval.

            I'm open to increasing/decreasing the hard-coded one, however, if you have better options.

            i have no suggestion.
            i'm more concerned/curious as to why "5 minutes"?
            if there's no good answer then isn't metrics important?
            and being able to configure it.

            quick examples that come to mind:

            • what if an installation has lots of jobs built upon each others data and for them there's a strong benefit (if not a requirement) for more accurate sizes (ie faster schedule rate),
            • what if there's bugs/load caused from this that can be avoided (for an installation that doesn't ever use hadoop/spark) by configuring it to zero (disabling), giving an immediate option to upgrading-to/waiting-for next version.
            mck Michael Semb Wever added a comment - - edited You are quoting the wrong code here, but how do you not background it? yes i can see that it's not possible at the moment. (i didn't realise that at first, but it really wasn't my train of thought either). When we add vtable support (cql tables backed by classes, not sstables) - then we'll switch sizing (and several other system sstables) to that. niceto know. thanks. This is a simple temporary replacement for describe_splits_ex, its only goal is to free Spark and others from having to maintain an extra Thrift connection now. Hence the lack of metrics or configurability of the refresh interval. I'm open to increasing/decreasing the hard-coded one, however, if you have better options. i have no suggestion. i'm more concerned/curious as to why "5 minutes"? if there's no good answer then isn't metrics important? and being able to configure it. quick examples that come to mind: what if an installation has lots of jobs built upon each others data and for them there's a strong benefit (if not a requirement) for more accurate sizes (ie faster schedule rate), what if there's bugs/load caused from this that can be avoided (for an installation that doesn't ever use hadoop/spark) by configuring it to zero (disabling), giving an immediate option to upgrading-to/waiting-for next version.

            You are quoting the wrong code here, but how do you not background it? It's not strictly about cost, it's about not having any other triggering mechanism.

            When we add vtable support (cql tables backed by classes, not sstables) - then we'll switch sizing (and several other system sstables) to that. Until then, what other options do we have?

            This is a simple temporary replacement for describe_splits_ex, its only goal is to free Spark and others from having to maintain an extra Thrift connection now. Hence the lack of metrics or configurability of the refresh interval.

            I'm open to increasing/decreasing the hard-coded one, however, if you have better options.

            aleksey Aleksey Yeschenko added a comment - You are quoting the wrong code here, but how do you not background it? It's not strictly about cost, it's about not having any other triggering mechanism. When we add vtable support (cql tables backed by classes, not sstables) - then we'll switch sizing (and several other system sstables) to that. Until then, what other options do we have? This is a simple temporary replacement for describe_splits_ex, its only goal is to free Spark and others from having to maintain an extra Thrift connection now . Hence the lack of metrics or configurability of the refresh interval. I'm open to increasing/decreasing the hard-coded one, however, if you have better options.
            mck Michael Semb Wever added a comment - - edited

            Can you please elaborate on what the idea is behind storing this info in a system table?

            I'm still curious on this question, as it wasn't about the removal of thrift (that's obvious, although it wasn't obvious that all "metadata" is only exposed via cql, eg ControlConnection.refreshSchema(..)) but around the reasoning for backgrounding/frequency-of the computation.

                    ScheduledExecutors.optionalTasks.schedule(runnable, 5, TimeUnit.MINUTES);

            Why 5 minutes? What's the trade-off here?
            How do we (everyone) know the computation is expensive enough to warrant backgrounding it?
            And that 5 minutes will give us the best throughput (across c* and its hadoop/spark jobs)?

            a) what about putting metrics around the code in SizeEstimatesRecorder.run() so we can get an idea for future adjustments?
            (going a step further could be do get updateSizeEstimates() to diff the old rows with new rows and having a metric on change frequency).

            b) what about making the frequency configurable?

            mck Michael Semb Wever added a comment - - edited Can you please elaborate on what the idea is behind storing this info in a system table? I'm still curious on this question, as it wasn't about the removal of thrift (that's obvious, although it wasn't obvious that all "metadata" is only exposed via cql, eg ControlConnection.refreshSchema(..)) but around the reasoning for backgrounding/frequency-of the computation. ScheduledExecutors.optionalTasks.schedule(runnable, 5, TimeUnit.MINUTES); Why 5 minutes? What's the trade-off here? How do we (everyone) know the computation is expensive enough to warrant backgrounding it? And that 5 minutes will give us the best throughput (across c* and its hadoop/spark jobs)? a) what about putting metrics around the code in SizeEstimatesRecorder.run() so we can get an idea for future adjustments? (going a step further could be do get updateSizeEstimates() to diff the old rows with new rows and having a metric on change frequency). b) what about making the frequency configurable?

            Committed, thanks.

            aleksey Aleksey Yeschenko added a comment - Committed, thanks.

            Ok, +1 then.

            pkolaczk Piotr Kolaczkowski added a comment - Ok, +1 then.

            Since it's a single partition update, the whole thing is atomic and isolated, yes. I'm adding updates to the mutation one by one, but applying everything, including the removal of previous state, and addition of the new data, in one go, at mutation.apply() point.

            So long as you fetch all the ranges together in one query, you'll always have a complete state. It might be slightly out of date and lagging behind (rare) topology updates for up to 5 minutes, but it'll always be internally consistent.

            aleksey Aleksey Yeschenko added a comment - Since it's a single partition update, the whole thing is atomic and isolated, yes. I'm adding updates to the mutation one by one, but applying everything, including the removal of previous state, and addition of the new data, in one go, at mutation.apply() point. So long as you fetch all the ranges together in one query, you'll always have a complete state. It might be slightly out of date and lagging behind (rare) topology updates for up to 5 minutes, but it'll always be internally consistent.

            Looks good.

                   // delete all previous values with a single range tombstone.
                    mutation.deleteRange(SIZE_ESTIMATES_CF,
                                         estimatesTable.comparator.make(table).start(),
                                         estimatesTable.comparator.make(table).end(),
                                         timestamp - 1);
            
                    // add a CQL row for each primary token range.
                    ColumnFamily cells = mutation.addOrGet(estimatesTable);
                    for (Map.Entry<Range<Token>, Pair<Long, Long>> entry : estimates.entrySet())
                    {
                        Range<Token> range = entry.getKey();
                        Pair<Long, Long> values = entry.getValue();
                        Composite prefix = estimatesTable.comparator.make(table, range.left.toString(), range.right.toString());
                        CFRowAdder adder = new CFRowAdder(cells, prefix, timestamp);
                        adder.add("partitions_count", values.left)
                             .add("mean_partition_size", values.right);
                    }
            
                    mutation.apply();
            

            Are updates of the table atomic? I can see you delete a whole bunch of token ranges with one tombstone and than add one by one. Is it possible to get an incomplete table when querying at the wrong moment?

            pkolaczk Piotr Kolaczkowski added a comment - Looks good. // delete all previous values with a single range tombstone. mutation.deleteRange(SIZE_ESTIMATES_CF, estimatesTable.comparator.make(table).start(), estimatesTable.comparator.make(table).end(), timestamp - 1); // add a CQL row for each primary token range. ColumnFamily cells = mutation.addOrGet(estimatesTable); for (Map.Entry<Range<Token>, Pair< Long , Long >> entry : estimates.entrySet()) { Range<Token> range = entry.getKey(); Pair< Long , Long > values = entry.getValue(); Composite prefix = estimatesTable.comparator.make(table, range.left.toString(), range.right.toString()); CFRowAdder adder = new CFRowAdder(cells, prefix, timestamp); adder.add( "partitions_count" , values.left) .add( "mean_partition_size" , values.right); } mutation.apply(); Are updates of the table atomic? I can see you delete a whole bunch of token ranges with one tombstone and than add one by one. Is it possible to get an incomplete table when querying at the wrong moment?

            Additionally, to reiterate what Sylvain said - we are open to improvements in accuracy, but those aren't trivial, and should go into another ticket.

            aleksey Aleksey Yeschenko added a comment - Additionally, to reiterate what Sylvain said - we are open to improvements in accuracy, but those aren't trivial, and should go into another ticket.

            Attached a simple 2.1-based patch.

            Just in case, let me repeat this: this ticket doesn't aim to radically improve, if at all, the current accuracy of describe_splits_ex. Its only explicit goal is to expose that information via CQL.

            That, in turn, is required so that Spark connector, and our native hadoop implementations aren't forced to keep both a native protocol connection and a Thrift connection for the sake of making estimation possible at all.

            This ticket is a companion to https://datastax-oss.atlassian.net/browse/JAVA-312.

            It's simple and non-invasive. However, writing a meaningful unit test for it is hard. It doesn't modify any internals in any way, and it will have some coverage on the spark-connector side. This is my attempt at justifying lack of a unit test for the patch.

            aleksey Aleksey Yeschenko added a comment - Attached a simple 2.1-based patch. Just in case, let me repeat this: this ticket doesn't aim to radically improve, if at all, the current accuracy of describe_splits_ex. Its only explicit goal is to expose that information via CQL. That, in turn, is required so that Spark connector, and our native hadoop implementations aren't forced to keep both a native protocol connection and a Thrift connection for the sake of making estimation possible at all. This ticket is a companion to https://datastax-oss.atlassian.net/browse/JAVA-312 . It's simple and non-invasive. However, writing a meaningful unit test for it is hard. It doesn't modify any internals in any way, and it will have some coverage on the spark-connector side. This is my attempt at justifying lack of a unit test for the patch.

            Was your intent to just periodically execute this function and dump the results into system tables?

            Not this exact function, but yes, just periodically dump sizing info there.

            aleksey Aleksey Yeschenko added a comment - Was your intent to just periodically execute this function and dump the results into system tables? Not this exact function, but yes, just periodically dump sizing info there.
            mbyrd Matt Byrd added a comment -

            So I suppose the reason for suggesting exposing the same call via cql,
            was that at least abstractly it was clear what this meant.
            I concede that plumbing all this through might not be straightforward.

            The problem with putting it in a system table is, what exactly do you put there?

            The current computation is a somewhat expensive on demand computation that is generally done relatively rarely.

            Was your intent to just periodically execute this function and dump the results into system tables?
            Or did you have something different in mind?

            mbyrd Matt Byrd added a comment - So I suppose the reason for suggesting exposing the same call via cql, was that at least abstractly it was clear what this meant. I concede that plumbing all this through might not be straightforward. The problem with putting it in a system table is, what exactly do you put there? The current computation is a somewhat expensive on demand computation that is generally done relatively rarely. Was your intent to just periodically execute this function and dump the results into system tables? Or did you have something different in mind?

            The primary goal is to remove the dependency on Thrift, so that the spark connector and our hadoop code don't have to open an extra Thrift connection in addition to the native protocol one. See CASSANDRA-8358 for example.

            Having that we'd be able to not start Thrift by default, and simpler implementations of spark and hadoop things.

            aleksey Aleksey Yeschenko added a comment - The primary goal is to remove the dependency on Thrift, so that the spark connector and our hadoop code don't have to open an extra Thrift connection in addition to the native protocol one. See CASSANDRA-8358 for example. Having that we'd be able to not start Thrift by default, and simpler implementations of spark and hadoop things.

            I would have thought it’d be easier to just expose the Storage proxy call via cql?

            It wouldn't, unless you propose to create an extra CQL statement just for this, which is something that's not gonna happen.

            Otherwise you'd need support for virtual tables, and that's 3.1 territory at best.

            aleksey Aleksey Yeschenko added a comment - I would have thought it’d be easier to just expose the Storage proxy call via cql? It wouldn't, unless you propose to create an extra CQL statement just for this, which is something that's not gonna happen. Otherwise you'd need support for virtual tables, and that's 3.1 territory at best.

            It would be an on-demand calculation that would be moderately expensive.

            iamaleksey If the implementation is but a rewrite, i'm also keen on mbyrd question.

            Can you please elaborate on what the idea is behind storing this info in a system table?

            mck Michael Semb Wever added a comment - It would be an on-demand calculation that would be moderately expensive. iamaleksey If the implementation is but a rewrite, i'm also keen on mbyrd question. Can you please elaborate on what the idea is behind storing this info in a system table?

            Fair enough. Just saying describe_splits is pretty bad for the reason it is not possible to set some reasonable default for split size. Some users were already pointing that out in our issue tracker.

            pkolaczk Piotr Kolaczkowski added a comment - Fair enough. Just saying describe_splits is pretty bad for the reason it is not possible to set some reasonable default for split size. Some users were already pointing that out in our issue tracker.

            To be clear, the target here is hadoop/spark and we're not looking at doing anything better than what is currently used by thrift describe_splits. Which is based on the sstable stats and that, yes, can be pretty bad for some datasets, but improving it is a goal for another ticket.

            slebresne Sylvain Lebresne added a comment - To be clear, the target here is hadoop/spark and we're not looking at doing anything better than what is currently used by thrift describe_splits. Which is based on the sstable stats and that, yes, can be pretty bad for some datasets, but improving it is a goal for another ticket.

            I'm talking about estimates. We cannot likely even estimate without pretty significant cost. Sampling column counts is pretty easy, but knowing how many cql rows there are for any merged row is not. There are tricks to make it easier, but there are datasets for which the tricks will not work, and any estimate would be complete guesswork without sampling the data.

            benedict Benedict Elliott Smith added a comment - I'm talking about estimates. We cannot likely even estimate without pretty significant cost. Sampling column counts is pretty easy, but knowing how many cql rows there are for any merged row is not. There are tricks to make it easier, but there are datasets for which the tricks will not work, and any estimate would be complete guesswork without sampling the data.

            We only need estimates, not exact values. Factor 1.5x error is considered an awesome estimate, factor 3x is still fairly good.
            Also note that Spark/Hadoop does many token range scans. Maybe collecting some statistics on the fly, during the scans (or during the compaction) would be viable? And running a full compaction to get statistics more accurate - why not? You need to do it anyway to get top speed when scanning data in Spark, because a full table scan is doing kind-of implicit compaction anyway, isn't it?

            Also, one more thing - it would be good to have those values per column (sorry for making it even harder, I know it is not an easy task). At least to know that a column is responsible for xx% of data in the table - knowing such thing would make a huge difference when estimating data size, because we're not always fetching all columns and they may vary in size a lot (e.g. collections!). Some sampling on insert would probably be enough.

            pkolaczk Piotr Kolaczkowski added a comment - We only need estimates, not exact values. Factor 1.5x error is considered an awesome estimate, factor 3x is still fairly good. Also note that Spark/Hadoop does many token range scans. Maybe collecting some statistics on the fly, during the scans (or during the compaction) would be viable? And running a full compaction to get statistics more accurate - why not? You need to do it anyway to get top speed when scanning data in Spark, because a full table scan is doing kind-of implicit compaction anyway, isn't it? Also, one more thing - it would be good to have those values per column (sorry for making it even harder, I know it is not an easy task). At least to know that a column is responsible for xx% of data in the table - knowing such thing would make a huge difference when estimating data size, because we're not always fetching all columns and they may vary in size a lot (e.g. collections!). Some sampling on insert would probably be enough.

            This is a fundamentally difficult problem, and to be answered accurately basically requires a full compaction. We can track or estimate this data for any given sstable easily, and we can estimate the number of overlapping partitions between two sstables (though the accuracy I'm unsure of if we composed this data across many sstables), but we cannot say how many rows within each overlapping partition overlap. The best we could do is probably sample some overlapping partitions to see what proportion of row overlap tends to prevail, and hope it is representative; if we assume a normal distribution of overlap ratio we could return error bounds.

            I don't think it's likely this data could be maintained live, at least not accurately, or not without significant cost. It would be an on-demand calculation that would be moderately expensive.

            benedict Benedict Elliott Smith added a comment - This is a fundamentally difficult problem, and to be answered accurately basically requires a full compaction. We can track or estimate this data for any given sstable easily, and we can estimate the number of overlapping partitions between two sstables (though the accuracy I'm unsure of if we composed this data across many sstables), but we cannot say how many rows within each overlapping partition overlap. The best we could do is probably sample some overlapping partitions to see what proportion of row overlap tends to prevail, and hope it is representative; if we assume a normal distribution of overlap ratio we could return error bounds. I don't think it's likely this data could be maintained live, at least not accurately, or not without significant cost. It would be an on-demand calculation that would be moderately expensive.
            pkolaczk Piotr Kolaczkowski added a comment - - edited

            It would be nice to know also the average partition size in the given table, both in bytes and in number of CQL rows. This would be useful to set appropriate fetch.size. Additionally, current split generation API does not allow to set split size in terms of data size in bytes or number of CQL rows, but only by number of partitions. Number of partitions doesn't make a nice default, as partitions can vary greatly in size and are extremely use-case dependent. So please, don't just copy current describe_splits_ex functionality to the new driver, but improve this.

            We really don't need the driver / Cassandra to do the splitting for us. Instead we need to know:

            1. estimate of total amount of data in the table in bytes
            2. estimate of total number of CQL rows in the table
            3. estimate of total number of partitions in the table

            We're interested both in totals (whole cluster; logical sizes; i.e. without replicas), and split by token-ranges by node (physical; incuding replicas).

            Note that this information is useful not just for Spark/Hadoop split generation, but also things like e.g. SparkSQL optimizer so it knows how much data will it have to process or to set appropriate fetch sizes when getting data, etc.

            The next step would be providing column data histograms to guide predicate selectivity.

            pkolaczk Piotr Kolaczkowski added a comment - - edited It would be nice to know also the average partition size in the given table, both in bytes and in number of CQL rows. This would be useful to set appropriate fetch.size. Additionally, current split generation API does not allow to set split size in terms of data size in bytes or number of CQL rows, but only by number of partitions. Number of partitions doesn't make a nice default, as partitions can vary greatly in size and are extremely use-case dependent. So please, don't just copy current describe_splits_ex functionality to the new driver, but improve this . We really don't need the driver / Cassandra to do the splitting for us. Instead we need to know: 1. estimate of total amount of data in the table in bytes 2. estimate of total number of CQL rows in the table 3. estimate of total number of partitions in the table We're interested both in totals (whole cluster; logical sizes; i.e. without replicas), and split by token-ranges by node (physical; incuding replicas). Note that this information is useful not just for Spark/Hadoop split generation, but also things like e.g. SparkSQL optimizer so it knows how much data will it have to process or to set appropriate fetch sizes when getting data, etc. The next step would be providing column data histograms to guide predicate selectivity.
            mbyrd Matt Byrd added a comment -

            Originally I was just thinking of exposing the same method available in thrift, via some cql syntax i.e:
            essentially from StorageProxy:
            public List<Pair<Range<Token>, Long>> getSplits(String keyspaceName, String cfName, Range<Token> range, int keysPerSplit, CFMetaData metadata)

            This in turn actually operates on the index intervals in memory, getting appropriately sized splits given the samples taken.

            Can you please elaborate on what the idea is behind storing this info in a system table?
            It would seem that you would need to keep doing the above computation or something similar and write the result to a system table.
            I would have thought it’d be easier to just expose the Storage proxy call via cql?

            mbyrd Matt Byrd added a comment - Originally I was just thinking of exposing the same method available in thrift, via some cql syntax i.e: essentially from StorageProxy: public List<Pair<Range<Token>, Long>> getSplits(String keyspaceName, String cfName, Range<Token> range, int keysPerSplit, CFMetaData metadata) This in turn actually operates on the index intervals in memory, getting appropriately sized splits given the samples taken. Can you please elaborate on what the idea is behind storing this info in a system table? It would seem that you would need to keep doing the above computation or something similar and write the result to a system table. I would have thought it’d be easier to just expose the Storage proxy call via cql?

            People

              aleksey Aleksey Yeschenko
              jjordan Jeremiah Jordan
              Aleksey Yeschenko
              Piotr Kolaczkowski
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: