Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-9061

Add backoff and recovery to cqlsh COPY FROM when write timeouts occur

Details

    • Improvement
    • Status: Resolved
    • Low
    • Resolution: Duplicate
    • None
    • Legacy/Tools

    Description

      Previous versions of COPY FROM didn't handle write timeouts because it was rarely fast enough for that to matter. Now that performance has improved, write timeouts are more likely to occur. We should handle these by backing off and retrying the operation.

      Attachments

        1. 9061-2.1.txt
          6 kB
          Carl Yeksigian
        2. 9061-suggested.txt
          8 kB
          Tom Hobbs

        Issue Links

          Activity

            carlyeks Carl Yeksigian added a comment -

            Adds retries and sleeping when we have timed out statements.

            Since we want to retry the statements that have timed out, we keep around all of the in-flight query messages to retry on time out. If we have a success, we just drop those messages altogether.

            Also, need to be careful in the case where in-flight queries try to replace previously used stream ids, so it tries to reap the successful queries first, then will retry the timed out ones.

            carlyeks Carl Yeksigian added a comment - Adds retries and sleeping when we have timed out statements. Since we want to retry the statements that have timed out, we keep around all of the in-flight query messages to retry on time out. If we have a success, we just drop those messages altogether. Also, need to be careful in the case where in-flight queries try to replace previously used stream ids, so it tries to reap the successful queries first, then will retry the timed out ones.
            thobbs Tom Hobbs added a comment -

            I think we can take a somewhat simpler approach. I've attached a patch that is untested but demonstrates roughly what I'm thinking of. (Most of the diff is just an indentation change.)

            I don't think we really need to track in-progress or successful operations. We can rely on the connection's in_flight count to know if there are in-progress operations. Successful operations don't require any further action. We don't need to track the request ID, because it's automatically released by the connection when it gets a response, and we can easily get a new one. Am I missing something in my suggested approach?

            I think we're going to need a dtest to exercise this. You could start from cqlsh_tests.TestCqlsh.test_copy_to(), setting a low enough write timeout that a large number of operations fail.

            thobbs Tom Hobbs added a comment - I think we can take a somewhat simpler approach. I've attached a patch that is untested but demonstrates roughly what I'm thinking of. (Most of the diff is just an indentation change.) I don't think we really need to track in-progress or successful operations. We can rely on the connection's in_flight count to know if there are in-progress operations. Successful operations don't require any further action. We don't need to track the request ID, because it's automatically released by the connection when it gets a response, and we can easily get a new one. Am I missing something in my suggested approach? I think we're going to need a dtest to exercise this. You could start from cqlsh_tests.TestCqlsh.test_copy_to() , setting a low enough write timeout that a large number of operations fail.

            We'll add back-off and recovery in CASSANDRA-9302, closing this as duplicate.

            stefania Stefania Alborghetti added a comment - We'll add back-off and recovery in CASSANDRA-9302 , closing this as duplicate.

            People

              carlyeks Carl Yeksigian
              thobbs Tom Hobbs
              Carl Yeksigian
              Stefania Alborghetti
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: