Cassandra
  1. Cassandra
  2. CASSANDRA-1379

Uncached row reads may block cached reads

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Later
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      The cap on the number of concurrent reads appears to cap the total number of concurrent reads instead of just capping the reads that are bound for disk. That is, given N concurrent readers if all of them are busy waiting on disk, even reads that can be served from the row cache will block waiting for them.

      1. CASSANDRA-1379.patch
        19 kB
        Javier Canillas

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          Are you suggesting splitting the read stage into check-cache and fetch-data stages?

          Show
          Jonathan Ellis added a comment - Are you suggesting splitting the read stage into check-cache and fetch-data stages?
          Hide
          David King added a comment -

          Not being familiar with the implementation I'm not sure I'm suggesting anything, but that sounds sane. Or checking the cache first in the stage previous

          Show
          David King added a comment - Not being familiar with the implementation I'm not sure I'm suggesting anything, but that sounds sane. Or checking the cache first in the stage previous
          Hide
          Chris Goffinet added a comment -

          It will also block BF access. BF is fast, but if you ran into the situation where you needed to check if a large percentage of keys don't exist in the system, this read path could block if you had uncached reads happening as well. Not sure it's possible to get around this unless you work on CASSANDRA-1576

          Show
          Chris Goffinet added a comment - It will also block BF access. BF is fast, but if you ran into the situation where you needed to check if a large percentage of keys don't exist in the system, this read path could block if you had uncached reads happening as well. Not sure it's possible to get around this unless you work on CASSANDRA-1576
          Hide
          Jonathan Ellis added a comment -

          Let's try splitting ReadCommand.getRow into check-cache-and-BF and fetch-data stages. Both ReadVerbHandler and StorageProxy care about this (for remote and local reads, respectively).

          Show
          Jonathan Ellis added a comment - Let's try splitting ReadCommand.getRow into check-cache-and-BF and fetch-data stages. Both ReadVerbHandler and StorageProxy care about this (for remote and local reads, respectively).
          Hide
          Javier Canillas added a comment -

          This issue has been around my head for some time and since we are using Cassandra on live environment (not exactly 0.7 version) I have taken the time to make a patch for you to check. Here you will find some roadmap of my change:

          Since Table.getRow was checking if the asked Key was in the RowCache before going directly into the Memtables and SSTables, I have made another method called getCachedRow that only checks for the key on the RowCache. If its not there, then a null object (actually a Row.Cf = null) is returned. If you see the code, you will notice that I didn't remove the Cache check on the Table.getRow since I prefer a double check on cache, than a second miss if, for some reason, someone put that row on cache after first check.

          Getting that in mind, I have added another public method to ReadCommand and implemented on SliceByNamesReadCommand and SliceFromReadCommand. This new method is actually only calling the Table.getCachedRow instead of Table.getRow.

          Doing so, now I went straight to see ReadVerbHandler, and I cloned it for a new ReadCacheVerbHandler (then I turned ReadVerbHandler into an abstract class and created a ReadPhyisicalVerbHandler to avoid duplicate code). This new class (ReadCacheVerbHandler) will call the ReadCommand.getCachedRow, if this method returns a Row with cf equals to null, then the command must be forwarded to another Stage in order to fulfill the requirements by making a physical read. This new stage was named PHYSICAL_READ and a new verb (PHYSICAL_READ) was created. The new stage was actually cloned from READ Stage (please check if this stage needs further tunning).
          In order to be able to manipulate the original message to change the Verb, ReadCommand.makeReadMessage was changed and extended to support overrides on messageId and Verb itself.
          So, after getting an empty result on a cache search, the read command is forwarded into the PHYSICAL_READ stage (freeing a READ thread), and it is the last one that actually respond the row; otherwise, the cached row is returned.

          Sorry for the large response and my bad english, I tried to explain the main idea, but you probably will understand all this mechanism by inspecting the code.

          Long life to Cassandra!

          Show
          Javier Canillas added a comment - This issue has been around my head for some time and since we are using Cassandra on live environment (not exactly 0.7 version) I have taken the time to make a patch for you to check. Here you will find some roadmap of my change: Since Table.getRow was checking if the asked Key was in the RowCache before going directly into the Memtables and SSTables, I have made another method called getCachedRow that only checks for the key on the RowCache. If its not there, then a null object (actually a Row.Cf = null) is returned. If you see the code, you will notice that I didn't remove the Cache check on the Table.getRow since I prefer a double check on cache, than a second miss if, for some reason, someone put that row on cache after first check. Getting that in mind, I have added another public method to ReadCommand and implemented on SliceByNamesReadCommand and SliceFromReadCommand. This new method is actually only calling the Table.getCachedRow instead of Table.getRow. Doing so, now I went straight to see ReadVerbHandler, and I cloned it for a new ReadCacheVerbHandler (then I turned ReadVerbHandler into an abstract class and created a ReadPhyisicalVerbHandler to avoid duplicate code). This new class (ReadCacheVerbHandler) will call the ReadCommand.getCachedRow, if this method returns a Row with cf equals to null, then the command must be forwarded to another Stage in order to fulfill the requirements by making a physical read. This new stage was named PHYSICAL_READ and a new verb (PHYSICAL_READ) was created. The new stage was actually cloned from READ Stage (please check if this stage needs further tunning). In order to be able to manipulate the original message to change the Verb, ReadCommand.makeReadMessage was changed and extended to support overrides on messageId and Verb itself. So, after getting an empty result on a cache search, the read command is forwarded into the PHYSICAL_READ stage (freeing a READ thread), and it is the last one that actually respond the row; otherwise, the cached row is returned. Sorry for the large response and my bad english, I tried to explain the main idea, but you probably will understand all this mechanism by inspecting the code. Long life to Cassandra!
          Hide
          Chris Burroughs added a comment -

          Is there any existing mechanism to detect and measure if this is occurring?

          Show
          Chris Burroughs added a comment - Is there any existing mechanism to detect and measure if this is occurring?
          Hide
          Javier Canillas added a comment -

          Actually, on the patch both reads times will impact on the only read counter. Maybe creating a separate counter for cache reads and physical reads would be good. On the other hand, i dont remember if there is a miss counter over cache, if there is, it would be surely impacted

          Show
          Javier Canillas added a comment - Actually, on the patch both reads times will impact on the only read counter. Maybe creating a separate counter for cache reads and physical reads would be good. On the other hand, i dont remember if there is a miss counter over cache, if there is, it would be surely impacted
          Hide
          Chris Burroughs added a comment -

          So unsurprisingly this patch no longer applies to trunk. Javier, what branch was the patch against originally?

          Also, shouldn't the fix version be the next major release for a change this large?

          Show
          Chris Burroughs added a comment - So unsurprisingly this patch no longer applies to trunk. Javier, what branch was the patch against originally? Also, shouldn't the fix version be the next major release for a change this large?
          Hide
          Jonathan Ellis added a comment -

          what branch was the patch against originally?

          It would have been trunk-as-of-shortly-after-0.7-release, so it would probably come closest to applying to the 0.7 branch today.

          shouldn't the fix version be the next major release for a change this large?

          Agreed.

          I've been trying to figure out why this ticket makes me uncomfortable, and I think I've figured it out: assuming each client has identical request distribution, which is the case for almost all applications, pulling the cache check out to a separate stage won't actually help clients progress any more when you are i/o bound, since once they issue a request-not-in-cache they have to wait for the physical read stage anyway.

          So while there's a certain ideological purity of having cache check on a separate stage, in reality it doesn't actually help you, while still adding a fair amount of complexity to the read path.

          Show
          Jonathan Ellis added a comment - what branch was the patch against originally? It would have been trunk-as-of-shortly-after-0.7-release, so it would probably come closest to applying to the 0.7 branch today. shouldn't the fix version be the next major release for a change this large? Agreed. I've been trying to figure out why this ticket makes me uncomfortable, and I think I've figured it out: assuming each client has identical request distribution, which is the case for almost all applications, pulling the cache check out to a separate stage won't actually help clients progress any more when you are i/o bound, since once they issue a request-not-in-cache they have to wait for the physical read stage anyway. So while there's a certain ideological purity of having cache check on a separate stage, in reality it doesn't actually help you, while still adding a fair amount of complexity to the read path.
          Hide
          David King added a comment -

          pulling the cache check out to a separate stage won't actually help clients progress any more when you are i/o bound, since once they issue a request-not-in-cache they have to wait for the physical read stage anyway.

          For reddit's use-case there are requests that are high priority and almost always memory bound (comments trees for active pages) and requests that are lower priority and almost always disk-bound (voting buttons on external sites, comments trees for old pages). It's very important that the high-priority tasks not get blocked by lower priority tasks, and that means not allowing the lower priority disk-bound tasks to clog up the read stage. It's okay if buttons on external sites slow down. It's not okay if all of the comments pages for extremely active pages slow down.

          Show
          David King added a comment - pulling the cache check out to a separate stage won't actually help clients progress any more when you are i/o bound, since once they issue a request-not-in-cache they have to wait for the physical read stage anyway. For reddit's use-case there are requests that are high priority and almost always memory bound (comments trees for active pages) and requests that are lower priority and almost always disk-bound (voting buttons on external sites, comments trees for old pages). It's very important that the high-priority tasks not get blocked by lower priority tasks, and that means not allowing the lower priority disk-bound tasks to clog up the read stage. It's okay if buttons on external sites slow down. It's not okay if all of the comments pages for extremely active pages slow down.
          Hide
          Jonathan Ellis added a comment -

          But are the comments 100% rowcached? If not, then this ticket doesn't help you.

          Show
          Jonathan Ellis added a comment - But are the comments 100% rowcached? If not, then this ticket doesn't help you.
          Hide
          David King added a comment -

          The ones on the active pages are, definitely. And they are the ones that need to not be blocked by slower-loading pages.

          Show
          David King added a comment - The ones on the active pages are, definitely. And they are the ones that need to not be blocked by slower-loading pages.
          Hide
          Jonathan Ellis added a comment -

          Oh, right, you created this ticket. So, we do have at least one use case.

          Show
          Jonathan Ellis added a comment - Oh, right, you created this ticket. So, we do have at least one use case.
          Hide
          Ryan King added a comment -

          We have use cases along these lines too. We've had to resort to bumping the read threads up much higher (128 or 256) for highly cached workloads.

          Show
          Ryan King added a comment - We have use cases along these lines too. We've had to resort to bumping the read threads up much higher (128 or 256) for highly cached workloads.
          Hide
          Jonathan Ellis added a comment -

          With each Stage costing us 0.1 to 0.2 ms, it doesn't seem worth adding extra latency to all requests for a relatively rare case. Maybe if we solve the latency problem we can revisit this.

          Show
          Jonathan Ellis added a comment - With each Stage costing us 0.1 to 0.2 ms, it doesn't seem worth adding extra latency to all requests for a relatively rare case. Maybe if we solve the latency problem we can revisit this.

            People

            • Assignee:
              Javier Canillas
              Reporter:
              David King
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development