Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      To start off the transition, I propose a modest (if not underwhelming) set of changes for stage 1:

      1. Convert read and write request paths to be fully non-blocking, and execute them directly within Netty context, avoiding any thread handoff (CASSANDRA-10993)
      2. Implement our own in-process page cache to complement (1) (CASSANDRA-5863)

      (2) is necessary to enable serving reads for memory-resident rows without handing them off to another stage.

      However, read requests that cannot be served from the cache will have to be handed off to a new thread pool (replacing the old READ stage), that would execute individual ReadCommand s using blocking I/O.

      The extra thread pool here is unfortunate, but cannot be avoided, as we have to support filesystems that aren’t xfs.

      For stage 1, we are not going to partition data ownership yet - every worker thread will be able to serve requests for any token. We are also not going to introduce processor affinity, or alter our partition or memtable data structures.

      Memtable flushing, compaction, and repair will not be modified beyond necessary changes caused by CASSANDRA-5863.

      With (1) and (2) combined we expect to see noticeable improvements for at least CL.ONE reads that can be served from memory and RF=1 writes. That, and not introducing any noticeable performance regressions for other types of requests is the success criteria for stage 1.

      I should note that we could do more transition work in parallel - in particular have the team working on making other components non-blocking, but don’t want to go that way for the following reasons:

      • Cassandra is a solid production-ready database, and should remain so. Introducing too much change in big chunks would make it hard to maintain stability
      • there is an argument to be made regarding not having (some of) maintenance task share the event loop with read and write requests handling loops, as they don’t necessarily benefit from it (cc Ariel Weisberg, who has an expanded comment prepared on this). Once we are done with stage 1, we will evaluate whether or not we should do that
      • introducing change progressively would give projects built on Cassandra (Stratio lucene-based search, Tuplejump’s integration, and DSE) to catch up and make necessary changes as they are being introduced

      This ticket will serve as an umbrella issue for all the work necessary for stage 1.

        Issue Links

          Activity

          Hide
          aweisberg Ariel Weisberg added a comment -

          One of the big benefits of TPC is small tasks where the overhead of context switching to a task is large.

          For large tasks the overhead of relying on a thread scheduler and real threads is in the noise. For tasks like compaction that can operate against immutable data sets there isn't a lot of overhead in terms of complexity in continuing to use a separate thread to access the data. Scheduling decisions and core affinity that you can express with TPC can also be expressed with real threads. The binding requires OS support, but scheduling is something you can have decent control of without OS support.

          I think that TPC delivers the most value for request processing where the data for a request is already available in memory (all writes, some reads). It also delivers a lot of value if you can feasibly schedule hundreds of thousands of disk IO requests asynchronously, but that seems like a hard thing to make happen from Java.

          Another thing I want to advocate for is not assuming that for cores that are provisioned to run a single thread that all threads are completely uniform in their task distribution. There is research that demonstrates that narrowing the focus of tasks assigned to a core can yield better performance even when you take into account extra messaging between cores.

          This is important for networking and file IO which both regularly enter the kernel. It's already proposed to give file IO dedicated threads (possibly cores), and I think it's a good idea to test with dedicated vs separate cores for network IO.

          My biggest concern with TPC is load skew especially temporal skew. If you have a completely uniform task distribution (no balancing) and no work stealing then a big compaction kicking off on one core is going to cut its capacity by some substantial percentage. Eventually all outstanding requests will be for that slow core. You can replace core with node and it's basically the same problem.

          Show
          aweisberg Ariel Weisberg added a comment - One of the big benefits of TPC is small tasks where the overhead of context switching to a task is large. For large tasks the overhead of relying on a thread scheduler and real threads is in the noise. For tasks like compaction that can operate against immutable data sets there isn't a lot of overhead in terms of complexity in continuing to use a separate thread to access the data. Scheduling decisions and core affinity that you can express with TPC can also be expressed with real threads. The binding requires OS support, but scheduling is something you can have decent control of without OS support. I think that TPC delivers the most value for request processing where the data for a request is already available in memory (all writes, some reads). It also delivers a lot of value if you can feasibly schedule hundreds of thousands of disk IO requests asynchronously, but that seems like a hard thing to make happen from Java. Another thing I want to advocate for is not assuming that for cores that are provisioned to run a single thread that all threads are completely uniform in their task distribution. There is research that demonstrates that narrowing the focus of tasks assigned to a core can yield better performance even when you take into account extra messaging between cores. This is important for networking and file IO which both regularly enter the kernel. It's already proposed to give file IO dedicated threads (possibly cores), and I think it's a good idea to test with dedicated vs separate cores for network IO. My biggest concern with TPC is load skew especially temporal skew. If you have a completely uniform task distribution (no balancing) and no work stealing then a big compaction kicking off on one core is going to cut its capacity by some substantial percentage. Eventually all outstanding requests will be for that slow core. You can replace core with node and it's basically the same problem.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Ariel Weisberg Thanks. These concerns are why I'm hesitant to have us start working on migrating compaction/flushing/streaming immediately - we might decide to not go there, in the end.

          There are some benefits in theory to this platonic ideal of fully homogenous cores:

          • even simpler data structures (not just single writer, but single reader too); in-process page (or row) cache would have to be multiple-writer otherwise
          • the ability to prioritise between all the load for a particular core's load (e.g. throttle writes if compactions and flushing get behind)

          The major (implicit) downside that you haven't mentioned is the amount of work necessary to move everything to TPC.

          Only once we convert the read and write requests paths in stage 1 to be non-blocking, and then introduce ownership partitioning in stage 2, we'll be able to experiment with moving maintenance to TPC as well, and decide based on numbers.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Ariel Weisberg Thanks. These concerns are why I'm hesitant to have us start working on migrating compaction/flushing/streaming immediately - we might decide to not go there, in the end. There are some benefits in theory to this platonic ideal of fully homogenous cores: even simpler data structures (not just single writer, but single reader too); in-process page (or row) cache would have to be multiple-writer otherwise the ability to prioritise between all the load for a particular core's load (e.g. throttle writes if compactions and flushing get behind) The major (implicit) downside that you haven't mentioned is the amount of work necessary to move everything to TPC. Only once we convert the read and write requests paths in stage 1 to be non-blocking, and then introduce ownership partitioning in stage 2, we'll be able to experiment with moving maintenance to TPC as well, and decide based on numbers.

            People

            • Assignee:
              Unassigned
              Reporter:
              iamaleksey Aleksey Yeschenko
            • Votes:
              4 Vote for this issue
              Watchers:
              39 Start watching this issue

              Dates

              • Created:
                Updated:

                Development