Cassandra
  1. Cassandra
  2. CASSANDRA-3623

use MMapedBuffer in CompressedSegmentedFile.getSegment

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Fix Version/s: 1.1.0
    • Component/s: Core
    • Labels:

      Description

      CompressedSegmentedFile.getSegment seem to open a new file and doesnt seem to use the MMap and hence a higher CPU on the nodes and higher latencies on reads.

      This ticket is to implement the TODO mentioned in CompressedRandomAccessReader

      // TODO refactor this to separate concept of "buffer to avoid lots of read() syscalls" and "compression buffer"
      but i think a separate class for the Buffer will be better.

        Activity

        Hide
        Vijay added a comment -

        Looks like it is not that efficient when we have less JVM memory is low enough and i will close this and will revisit it when i have a better solution for this...

        Show
        Vijay added a comment - Looks like it is not that efficient when we have less JVM memory is low enough and i will close this and will revisit it when i have a better solution for this...
        Hide
        Pavel Yaskevich added a comment -

        +1 on closing it with wontfix.

        Show
        Pavel Yaskevich added a comment - +1 on closing it with wontfix.
        Hide
        Jonathan Ellis added a comment -

        Since we can't demonstrate a consistent win with this patch I think we either need to close it as wontfix or go back to the drawing board.

        Show
        Jonathan Ellis added a comment - Since we can't demonstrate a consistent win with this patch I think we either need to close it as wontfix or go back to the drawing board.
        Hide
        Yuki Morishita added a comment -

        Vijay, Pavel,

        I did the test similar to Pavel's on physical machine (4core 2.6GHz Xeon/16GB RAM/Linux(debian)) with trunk + 3623(v3) + 3610(v3).
        Cassandra is run on following jvm.

        $ java -version
        java version "1.6.0_26"
        Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
        Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
        

        with jvm args:

        -ea
        -javaagent:bin/../lib/jamm-0.2.5.jar
        -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
        -Xms6G -Xmx6G -Xmn2G -Xss128k
        -XX:+HeapDumpOnOutOfMemoryError
        -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
        -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
        -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199
        -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
        -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true
        

        Populate enough data with stress tool, set crc_check_chance to 0.0, flush and compact.
        Befor each test run, clean page cache. Stress tool is run from another machine.

        • data_access_mode: mmap
        $ tools/stress/bin/stress -n 500000 -S 1024 -I SnappyCompressor -o read -d node0
        total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
        27487,2748,2748,0.01813206242951213,10
        65226,3773,3773,0.013355361827287422,20
        103145,3791,3791,0.01334416372171528,30
        141092,3794,3794,0.013307842310530199,40
        178981,3788,3788,0.013323840692549289,50
        217062,3808,3808,0.013260129723484152,60
        255020,3795,3795,0.01330330892038569,70
        293075,3805,3805,0.013265825778478518,80
        331046,3797,3797,0.013295910036606884,91
        369059,3801,3801,0.01328353458027517,101
        i407030,3797,3797,0.01329540965473651,111
        444920,3789,3789,0.013323251517550806,121
        482894,3797,3797,0.013299231052825617,131
        500000,1710,1710,0.010978779375657664,136
        END
        
        • data_access_mode: standard
        $ tools/stress/bin/stress -n 500000 -S 1024 -I SnappyCompressor -o read -d node0
        total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
        25474,2547,2547,0.019527989322446416,10
        117046,9157,9157,0.005506617743415018,20
        211863,9481,9481,0.005313298248204436,30
        306773,9491,9491,0.005311305447265831,40
        401107,9433,9433,0.005341160133143935,50
        496051,9494,9494,0.005200739383215369,60
        500000,394,394,0.0019680931881488986,61
        END
        

        I ran the above several times (making sure each test is isolated), for each iteration I observe about the same result.

        Things I noticed when digging with VisualVM

        • Snappy uncompression with direct bytebuffers seems slightly faster, but its impact to overall read performace is negligible.
        • I observed that CompressedMappedFileDataInput.reBuffer is called many times especially from the path CMFDI.reset -> CMFDI.seek -> CMFDI.reBuffer.
        • When using CMFDI, I observe higher cpu usage than CRAR over all.

        Right now I cannot find the reason to use mmapped bytebuffers for compressed files.

        Show
        Yuki Morishita added a comment - Vijay, Pavel, I did the test similar to Pavel's on physical machine (4core 2.6GHz Xeon/16GB RAM/Linux(debian)) with trunk + 3623(v3) + 3610(v3). Cassandra is run on following jvm. $ java -version java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) with jvm args: -ea -javaagent:bin/../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms6G -Xmx6G -Xmn2G -Xss128k -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack= true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.ssl= false -Dcom.sun.management.jmxremote.authenticate= false -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride= true Populate enough data with stress tool, set crc_check_chance to 0.0, flush and compact. Befor each test run, clean page cache. Stress tool is run from another machine. data_access_mode: mmap $ tools/stress/bin/stress -n 500000 -S 1024 -I SnappyCompressor -o read -d node0 total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time 27487,2748,2748,0.01813206242951213,10 65226,3773,3773,0.013355361827287422,20 103145,3791,3791,0.01334416372171528,30 141092,3794,3794,0.013307842310530199,40 178981,3788,3788,0.013323840692549289,50 217062,3808,3808,0.013260129723484152,60 255020,3795,3795,0.01330330892038569,70 293075,3805,3805,0.013265825778478518,80 331046,3797,3797,0.013295910036606884,91 369059,3801,3801,0.01328353458027517,101 i407030,3797,3797,0.01329540965473651,111 444920,3789,3789,0.013323251517550806,121 482894,3797,3797,0.013299231052825617,131 500000,1710,1710,0.010978779375657664,136 END data_access_mode: standard $ tools/stress/bin/stress -n 500000 -S 1024 -I SnappyCompressor -o read -d node0 total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time 25474,2547,2547,0.019527989322446416,10 117046,9157,9157,0.005506617743415018,20 211863,9481,9481,0.005313298248204436,30 306773,9491,9491,0.005311305447265831,40 401107,9433,9433,0.005341160133143935,50 496051,9494,9494,0.005200739383215369,60 500000,394,394,0.0019680931881488986,61 END I ran the above several times (making sure each test is isolated), for each iteration I observe about the same result. Things I noticed when digging with VisualVM Snappy uncompression with direct bytebuffers seems slightly faster, but its impact to overall read performace is negligible. I observed that CompressedMappedFileDataInput.reBuffer is called many times especially from the path CMFDI.reset -> CMFDI.seek -> CMFDI.reBuffer. When using CMFDI, I observe higher cpu usage than CRAR over all. Right now I cannot find the reason to use mmapped bytebuffers for compressed files.
        Hide
        Vijay added a comment -

        Complete isolation => delete files and load again + yes my scripts do cleanup the page caches.

        Show
        Vijay added a comment - Complete isolation => delete files and load again + yes my scripts do cleanup the page caches.
        Hide
        Pavel Yaskevich added a comment -

        The attached test results was done in complete isolation to one another.

        That doesn't mean that cache drop is not required tho

        Show
        Pavel Yaskevich added a comment - The attached test results was done in complete isolation to one another. That doesn't mean that cache drop is not required tho
        Hide
        Vijay added a comment -

        >>> Vijay, maybe you weren't dropping page cache between tests?
        I did (the only time i dont drop my pages is when i work on my laptop). The attached test results was done in complete isolation to one another.

        Yuki, Plz note that while using trunk... Plz set the CF settings of crc_chance: 0.0 before loading any data.

        Thanks!

        Show
        Vijay added a comment - >>> Vijay, maybe you weren't dropping page cache between tests? I did (the only time i dont drop my pages is when i work on my laptop). The attached test results was done in complete isolation to one another. Yuki, Plz note that while using trunk... Plz set the CF settings of crc_chance: 0.0 before loading any data. Thanks!
        Hide
        Pavel Yaskevich added a comment -

        I'm still don't think that this is a good idea because tests don't how any significant improvement in performance and Java still has very limited arsenal of functionality to work with mmap'ed files, program doesn't have a full control over ByteBufferes sharing mmapp'ed memory which could lead to problems like CASSANDRA-3179.

        By the way, Vijay, maybe you weren't dropping page cache between tests? Let's see what Yuki has to say.

        Show
        Pavel Yaskevich added a comment - I'm still don't think that this is a good idea because tests don't how any significant improvement in performance and Java still has very limited arsenal of functionality to work with mmap'ed files, program doesn't have a full control over ByteBufferes sharing mmapp'ed memory which could lead to problems like CASSANDRA-3179 . By the way, Vijay, maybe you weren't dropping page cache between tests? Let's see what Yuki has to say.
        Hide
        Jonathan Ellis added a comment -

        I'll ask Yuki to take a look.

        Show
        Jonathan Ellis added a comment - I'll ask Yuki to take a look.
        Hide
        Vijay added a comment -

        Hi Jonathan,
        The env which i was testing was on XFS and NOOP scheduler (with JVM heap of 12/2). I also have a lot of memory (70G) to spare on those boxes...
        Having said that I am not sure if it is was just me who will be benefited by this patch... may be a 3rd opinion might help.

        Show
        Vijay added a comment - Hi Jonathan, The env which i was testing was on XFS and NOOP scheduler (with JVM heap of 12/2). I also have a lot of memory (70G) to spare on those boxes... Having said that I am not sure if it is was just me who will be benefited by this patch... may be a 3rd opinion might help.
        Hide
        Jonathan Ellis added a comment -

        What's the verdict here? Do we need to get a 3rd opinion?

        Show
        Jonathan Ellis added a comment - What's the verdict here? Do we need to get a 3rd opinion?
        Hide
        Pavel Yaskevich added a comment -

        This is not so surprising for me in the situation when work-set does not fit into memory, which is not a rate case, I have expected mapped I/O to be slightly better, I also wanted to mention that although patch don't copy from kernelspace to userspace it does buffer duplication which means higher object allocation rates which would affect performance.

        Show
        Pavel Yaskevich added a comment - This is not so surprising for me in the situation when work-set does not fit into memory, which is not a rate case, I have expected mapped I/O to be slightly better, I also wanted to mention that although patch don't copy from kernelspace to userspace it does buffer duplication which means higher object allocation rates which would affect performance.
        Hide
        Vijay added a comment -

        I can't believe you get worser performance even with 2x less copying and less gc (because we don't copy to JVM and out of it to snappy). There is definitely some thing wrong here, none of my tests show any thing <= the Mapio running it once or twice never had anything equal to the regular io performance.
        Are u running into memory pressure with mmapio? Anyway....

        Show
        Vijay added a comment - I can't believe you get worser performance even with 2x less copying and less gc (because we don't copy to JVM and out of it to snappy). There is definitely some thing wrong here, none of my tests show any thing <= the Mapio running it once or twice never had anything equal to the regular io performance. Are u running into memory pressure with mmapio? Anyway....
        Hide
        Pavel Yaskevich added a comment -

        MMappedIO-Performance.docx test for 10,000 columnSize is misleading again because you actually use -S 100000 which is 10 times bigger than suggested 10,000.

        Tested v3 of this patch + 3610 (v3) and 3611 (v4) with disc_access_mode: mmap (and standard) and crc_check_chance: 0.0 on my real machine with 2GB RAM and Quad-Core AMD Opteron processor under Debian (2.6.35) GNU/Linux.

        Used stress tool to populate db "./bin/stress -n 300000 -S 512 -I SnappyCompressor" and right after that used "update column family Standard1 with compression_options =

        {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor', 'crc_check_chance':'0.0'}

        ;" from CLI, made sure all was flushed/compacted and stopped Cassandra. Please note that generated data does not entirely fit into page cache.

        Test #1:

        1. sync && echo 1 > /proc/sys/vm/drop_caches
        2. changed ./conf/cassandra.yaml with "disk_access_mode: mmap"
        3. started Cassandra
        4. run `./bin/stress -n 300000 -S 512 -o read`
          total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
          16438,1643,1643,0.029554629516972866,10
          40997,2455,2455,0.020681908872511097,20
          66256,2525,2525,0.020270200720535255,30
          90857,2460,2460,0.020607454981504816,41
          115779,2492,2492,0.020273372923521386,51
          141033,2525,2525,0.020168923734853884,61
          166268,2523,2523,0.020269823657618386,72
          191018,2475,2475,0.02026589898989899,82
          216367,2534,2534,0.020031519981064342,92
          241153,2478,2478,0.020092875010086338,102
          265959,2480,2480,0.020124244134483594,113
          290228,2426,2426,0.019975400716964027,123
          300000,977,977,0.012085448219402373,127
          
        5. run #4 once again to see how populated page cache affected performance
          total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
          50913,5091,5091,0.0036437844951191247,10
          106548,5563,5563,0.003795344657140289,20
          164274,5772,5772,0.0050692928662994146,30
          220312,5603,5603,0.003771262357685856,40
          276125,5581,5581,0.0037274111766076004,50
          300000,2387,2387,0.003665089005235602,55
          

        Test #2:

        1. sync && echo 1 > /proc/sys/vm/drop_caches
        2. changed ./conf/cassandra.yaml with "disk_access_mode: standard"
        3. started Cassandra
        4. run `./bin/stress -n 300000 -S 512 -o read`
          total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
          36048,3604,3604,0.00862633155792277,10
          92134,5608,5608,0.004530007488499804,20
          148475,5634,5634,0.004739603485916118,30
          204987,5651,5651,0.004508653029445074,40
          262779,5779,5779,0.004955564784053157,51
          300000,3722,3722,0.004320276188173343,57
          
        5. run #4 once again to see how populated page cache affected performance
          pavel1:/usr/src/cassandra/tools/stress# ./bin/stress -n 300000 -S 512 -I SnappyCompressor -o read
          total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
          50151,5015,5015,0.004033399134613467,10
          105726,5557,5557,0.0039961673414304994,20
          162237,5651,5651,0.003965387269735096,30
          218366,5612,5612,0.003923764898715459,40
          274388,5602,5602,0.003912695012673592,50
          300000,2561,2561,0.0034509995314696237,55
          

        I did re-run mmap test on the cold page cache few times to make sure that it's the real behavior. The test shows that mmap and standard I/O are not really different on my machine and mmap'ed I/O performs worse on the cold cache, the same effect would stand in the situation with high number of page faults.

        Show
        Pavel Yaskevich added a comment - MMappedIO-Performance.docx test for 10,000 columnSize is misleading again because you actually use -S 100000 which is 10 times bigger than suggested 10,000. Tested v3 of this patch + 3610 (v3) and 3611 (v4) with disc_access_mode: mmap (and standard) and crc_check_chance: 0.0 on my real machine with 2GB RAM and Quad-Core AMD Opteron processor under Debian (2.6.35) GNU/Linux. Used stress tool to populate db "./bin/stress -n 300000 -S 512 -I SnappyCompressor" and right after that used "update column family Standard1 with compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor', 'crc_check_chance':'0.0'} ;" from CLI, made sure all was flushed/compacted and stopped Cassandra. Please note that generated data does not entirely fit into page cache. Test #1: sync && echo 1 > /proc/sys/vm/drop_caches changed ./conf/cassandra.yaml with "disk_access_mode: mmap" started Cassandra run `./bin/stress -n 300000 -S 512 -o read` total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time 16438,1643,1643,0.029554629516972866,10 40997,2455,2455,0.020681908872511097,20 66256,2525,2525,0.020270200720535255,30 90857,2460,2460,0.020607454981504816,41 115779,2492,2492,0.020273372923521386,51 141033,2525,2525,0.020168923734853884,61 166268,2523,2523,0.020269823657618386,72 191018,2475,2475,0.02026589898989899,82 216367,2534,2534,0.020031519981064342,92 241153,2478,2478,0.020092875010086338,102 265959,2480,2480,0.020124244134483594,113 290228,2426,2426,0.019975400716964027,123 300000,977,977,0.012085448219402373,127 run #4 once again to see how populated page cache affected performance total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time 50913,5091,5091,0.0036437844951191247,10 106548,5563,5563,0.003795344657140289,20 164274,5772,5772,0.0050692928662994146,30 220312,5603,5603,0.003771262357685856,40 276125,5581,5581,0.0037274111766076004,50 300000,2387,2387,0.003665089005235602,55 Test #2: sync && echo 1 > /proc/sys/vm/drop_caches changed ./conf/cassandra.yaml with "disk_access_mode: standard" started Cassandra run `./bin/stress -n 300000 -S 512 -o read` total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time 36048,3604,3604,0.00862633155792277,10 92134,5608,5608,0.004530007488499804,20 148475,5634,5634,0.004739603485916118,30 204987,5651,5651,0.004508653029445074,40 262779,5779,5779,0.004955564784053157,51 300000,3722,3722,0.004320276188173343,57 run #4 once again to see how populated page cache affected performance pavel1:/usr/src/cassandra/tools/stress# ./bin/stress -n 300000 -S 512 -I SnappyCompressor -o read total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time 50151,5015,5015,0.004033399134613467,10 105726,5557,5557,0.0039961673414304994,20 162237,5651,5651,0.003965387269735096,30 218366,5612,5612,0.003923764898715459,40 274388,5602,5602,0.003912695012673592,50 300000,2561,2561,0.0034509995314696237,55 I did re-run mmap test on the cold page cache few times to make sure that it's the real behavior. The test shows that mmap and standard I/O are not really different on my machine and mmap'ed I/O performs worse on the cold cache, the same effect would stand in the situation with high number of page faults.
        Hide
        Vijay added a comment -

        Done,
        1) fixed the data for 10K
        2) rebased 3610

        Thanks!

        Show
        Vijay added a comment - Done, 1) fixed the data for 10K 2) rebased 3610 Thanks!
        Hide
        Pavel Yaskevich added a comment -

        I ask because you mentioned previously that you done tests on 12 node cluster. Testing results on the cloud depend on your neigbours that is why I/O could differ dramatically as it does in your tests, let's settle with CASSADRA-3611 (and CASSANDRA-3610) and I will test it again on the real machine.

        Show
        Pavel Yaskevich added a comment - I ask because you mentioned previously that you done tests on 12 node cluster. Testing results on the cloud depend on your neigbours that is why I/O could differ dramatically as it does in your tests, let's settle with CASSADRA-3611 (and CASSANDRA-3610 ) and I will test it again on the real machine.
        Hide
        Vijay added a comment -

        " Also I took a look at the doc you have attached and it looks like test for 10000 is broken because stress command line shows that you use -S 3000 instead of 10000."
        I will fix it.

        "Also as I mentioned before - you test on the different nodes on the working cluster, there are side factors that could be affecting test results. Can you please explain why testing performance on the working cluster is a good idea?"

        How do you know it is a working cluster? They are individual machine isolated without any network access to any other machine. There isnt anything which is been shared between those machines (They are VM's from the diffrent servers than the results which i have ever published). I created this test in different just to make a clean environment with cold cache (other option is to reset the mmap which i dont want to do).

        I know you have your doubts but I am not that bad

        Show
        Vijay added a comment - " Also I took a look at the doc you have attached and it looks like test for 10000 is broken because stress command line shows that you use -S 3000 instead of 10000." I will fix it. "Also as I mentioned before - you test on the different nodes on the working cluster, there are side factors that could be affecting test results. Can you please explain why testing performance on the working cluster is a good idea?" How do you know it is a working cluster? They are individual machine isolated without any network access to any other machine. There isnt anything which is been shared between those machines (They are VM's from the diffrent servers than the results which i have ever published). I created this test in different just to make a clean environment with cold cache (other option is to reset the mmap which i dont want to do). I know you have your doubts but I am not that bad
        Hide
        Pavel Yaskevich added a comment -

        CASSANDRA-3610 needs rebase to be applied on the latest trunk. Also I took a look at the doc you have attached and it looks like test for 10000 is broken because stress command line shows that you use -S 3000 instead of 10000.

        Compressed Reads: *10,000* columnSize:
        [vijay_tcasstest@vijay_tcass--1a-i-2801d94a ~]$ java -Xms2G -Xmx2G -Xmn1G -XX:+HeapDumpOnOutOfMemoryError -Xss128k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -jar Stress.jar -p 7102 -d 10.87.81.75 -n 500000 -S *3000* -I SnappyCompressor -o read
        

        Also as I mentioned before - you test on the different nodes on the working cluster, there are side factors that could be affecting test results. Can you please explain why testing performance on the working cluster is a good idea?

        Show
        Pavel Yaskevich added a comment - CASSANDRA-3610 needs rebase to be applied on the latest trunk. Also I took a look at the doc you have attached and it looks like test for 10000 is broken because stress command line shows that you use -S 3000 instead of 10000. Compressed Reads: *10,000* columnSize: [vijay_tcasstest@vijay_tcass--1a-i-2801d94a ~]$ java -Xms2G -Xmx2G -Xmn1G -XX:+HeapDumpOnOutOfMemoryError -Xss128k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -jar Stress.jar -p 7102 -d 10.87.81.75 -n 500000 -S *3000* -I SnappyCompressor -o read Also as I mentioned before - you test on the different nodes on the working cluster, there are side factors that could be affecting test results. Can you please explain why testing performance on the working cluster is a good idea?
        Hide
        Vijay added a comment - - edited

        Alright i think i found the the missing pieces:
        1) Plz reapply v2 from CASSANDRA-3611 (which also depends on CASSANDRA-3610)
        2) Plz reapply v3 which has the mark() (this seem to be used by range slice and Stress tool does it).
        3) Plz set the CRC chance to 0.0 by update chance - We need to do this before the SST's are created otherwise it wont take into effect. (update statements i used is in the *.doc attached)
        You might not see any diffrence if it is not set, because thats a big bottleneck.
        4) I used SunJDK for the test.

        The Test Results are attached, let me know in case of any questions... the performance seem to be better.

        I Used stress test so we are in the same page, and when the Column size or the range of columns to be fetched increases the performance gets better (rebuffers)

        Show
        Vijay added a comment - - edited Alright i think i found the the missing pieces: 1) Plz reapply v2 from CASSANDRA-3611 (which also depends on CASSANDRA-3610 ) 2) Plz reapply v3 which has the mark() (this seem to be used by range slice and Stress tool does it). 3) Plz set the CRC chance to 0.0 by update chance - We need to do this before the SST's are created otherwise it wont take into effect. (update statements i used is in the *.doc attached) You might not see any diffrence if it is not set, because thats a big bottleneck. 4) I used SunJDK for the test. The Test Results are attached, let me know in case of any questions... the performance seem to be better. I Used stress test so we are in the same page, and when the Column size or the range of columns to be fetched increases the performance gets better (rebuffers)
        Hide
        Pavel Yaskevich added a comment -

        Mean while your claim here is that snappy library is taking more CPU because we give it DirectBB?

        First of all I don't claim that it takes more CPU, I claim that it takes longer time to decompress data comparing to normal reads. Second, I don't think it's a problem with direct BB itself (btw, there is no way you can pass not direct buffer) but instead with mmap'ed I/O in that case.

        Can you plz conform you tried v2 and gives a worse performance than trunk and it is Linux (v1 doesn't give a better performance gains where as v2 does)?

        Yes I tried v2 and it wasn't easy because first of all it wasn't rebased, then I figured out that I needed to apply CASSANDRA-3611 and change call to FBUtilities.newCRC32() to "new CRC32()" for it to compile, after that I added "disk_access_mode: mmap" to the conf/cassandra.yaml and I used stress "./bin/stress -n 300000 -S 512 -I SnappyCompressor" to insert test data (which don't fit into page cache) and tried to read with "./bin/stress -n 300000 -I SnappyCompressor -o read" but got the following exceptions:

        java.lang.RuntimeException: java.lang.UnsupportedOperationException
        	at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1283)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        	at java.lang.Thread.run(Thread.java:662)
        Caused by: java.lang.UnsupportedOperationException
        	at org.apache.cassandra.io.compress.CompressedMappedFileDataInput.mark(CompressedMappedFileDataInput.java:212)
        	at org.apache.cassandra.db.columniterator.SimpleSliceReader.<init>(SimpleSliceReader.java:62)
        	at org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:90)
        	at org.apache.cassandra.db.columniterator.SSTableSliceIterator.<init>(SSTableSliceIterator.java:66)
        	at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:66)
        	at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:78)
        	at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:232)
        	at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:62)
        	at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1283)
        	at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1169)
        	at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1136)
        	at org.apache.cassandra.db.Table.getRow(Table.java:375)
        	at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69)
        	at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:800)
        	at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1279)
        	... 3 more
        

        and

        ava.lang.RuntimeException: java.lang.UnsupportedOperationException
        	at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1283)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        	at java.lang.Thread.run(Thread.java:662)
        Caused by: java.lang.UnsupportedOperationException
        	at org.apache.cassandra.io.compress.CompressedMappedFileDataInput.reset(CompressedMappedFileDataInput.java:207)
        	at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:78)
        	at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:40)
        	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
        	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
        	at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:107)
        	at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:145)
        	at org.apache.cassandra.utils.MergeIterator$ManyToOne.<init>(MergeIterator.java:88)
        	at org.apache.cassandra.utils.MergeIterator.get(MergeIterator.java:47)
        	at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:137)
        	at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:246)
        	at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:62)
        	at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1283)
        	at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1169)
        	at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1136)
        	at org.apache.cassandra.db.Table.getRow(Table.java:375)
        	at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69)
        	at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:800)
        	at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1279)
        	... 3 more
        

        After I managed to implement mark()/reset() methods I got the following results: current trunk 67 sec and your patch 101 sec to run read on 300000 rows. I have tested everything on the server without any interference network and it seems that my results are clearer from side effects than yours. I'm still not convinced that mmap'ed I/O is better for compressed data than syscalls and I know that it has side effects that we can't control from java (mentioned above) so I'm waiting for convincing results or we should close this ticket...

        Show
        Pavel Yaskevich added a comment - Mean while your claim here is that snappy library is taking more CPU because we give it DirectBB? First of all I don't claim that it takes more CPU, I claim that it takes longer time to decompress data comparing to normal reads. Second, I don't think it's a problem with direct BB itself (btw, there is no way you can pass not direct buffer) but instead with mmap'ed I/O in that case. Can you plz conform you tried v2 and gives a worse performance than trunk and it is Linux (v1 doesn't give a better performance gains where as v2 does)? Yes I tried v2 and it wasn't easy because first of all it wasn't rebased, then I figured out that I needed to apply CASSANDRA-3611 and change call to FBUtilities.newCRC32() to "new CRC32()" for it to compile, after that I added "disk_access_mode: mmap" to the conf/cassandra.yaml and I used stress "./bin/stress -n 300000 -S 512 -I SnappyCompressor" to insert test data (which don't fit into page cache) and tried to read with "./bin/stress -n 300000 -I SnappyCompressor -o read" but got the following exceptions: java.lang.RuntimeException: java.lang.UnsupportedOperationException at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1283) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662) Caused by: java.lang.UnsupportedOperationException at org.apache.cassandra.io.compress.CompressedMappedFileDataInput.mark(CompressedMappedFileDataInput.java:212) at org.apache.cassandra.db.columniterator.SimpleSliceReader.<init>(SimpleSliceReader.java:62) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:90) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.<init>(SSTableSliceIterator.java:66) at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:66) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:78) at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:232) at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:62) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1283) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1169) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1136) at org.apache.cassandra.db.Table.getRow(Table.java:375) at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69) at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:800) at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1279) ... 3 more and ava.lang.RuntimeException: java.lang.UnsupportedOperationException at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1283) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662) Caused by: java.lang.UnsupportedOperationException at org.apache.cassandra.io.compress.CompressedMappedFileDataInput.reset(CompressedMappedFileDataInput.java:207) at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:78) at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:40) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:107) at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:145) at org.apache.cassandra.utils.MergeIterator$ManyToOne.<init>(MergeIterator.java:88) at org.apache.cassandra.utils.MergeIterator.get(MergeIterator.java:47) at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:137) at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:246) at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:62) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1283) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1169) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1136) at org.apache.cassandra.db.Table.getRow(Table.java:375) at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69) at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:800) at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1279) ... 3 more After I managed to implement mark()/reset() methods I got the following results: current trunk 67 sec and your patch 101 sec to run read on 300000 rows. I have tested everything on the server without any interference network and it seems that my results are clearer from side effects than yours. I'm still not convinced that mmap'ed I/O is better for compressed data than syscalls and I know that it has side effects that we can't control from java (mentioned above) so I'm waiting for convincing results or we should close this ticket...
        Hide
        Vijay added a comment -

        Network is because the test has 12 nodes and it is responding to the read requests all the nodes are showing similar data if you want i can send u the data.

        Mean while your claim here is that snappy library is taking more CPU because we give it DirectBB?
        Can you plz conform you tried v2 and gives a worse performance than trunk and it is Linux (v1 doesn't give a better performance gains where as v2 does)?

        if yes i can close this ticket may be it is just me/AWS(Unknown reason) which i am not sure and i dont have bare metal to test on.

        Show
        Vijay added a comment - Network is because the test has 12 nodes and it is responding to the read requests all the nodes are showing similar data if you want i can send u the data. Mean while your claim here is that snappy library is taking more CPU because we give it DirectBB? Can you plz conform you tried v2 and gives a worse performance than trunk and it is Linux (v1 doesn't give a better performance gains where as v2 does)? if yes i can close this ticket may be it is just me/AWS(Unknown reason) which i am not sure and i dont have bare metal to test on.
        Hide
        Pavel Yaskevich added a comment -

        Isn't node with the higher NetRxKb and NetTxKb more loaded? Can you try to collect statistics on the same node or relatively loaded nodes?

        Show
        Pavel Yaskevich added a comment - Isn't node with the higher NetRxKb and NetTxKb more loaded? Can you try to collect statistics on the same node or relatively loaded nodes?
        Hide
        Vijay added a comment -

        Note: read Latency is from the NodeTool cfstats.

        Show
        Vijay added a comment - Note: read Latency is from the NodeTool cfstats.
        Hide
        Vijay added a comment -

        it does,
        `/sbin/ifconfig eth0 | grep "RX bytes" | cut -d: -f2 | awk '

        { printf $1 }'`
        `/sbin/ifconfig eth0 | grep "TX bytes" | awk ' {print $6}' | cut -d: -f2 | awk '{ printf $1 }

        '`

        Show
        Vijay added a comment - it does, `/sbin/ifconfig eth0 | grep "RX bytes" | cut -d: -f2 | awk ' { printf $1 }'` `/sbin/ifconfig eth0 | grep "TX bytes" | awk ' {print $6}' | cut -d: -f2 | awk '{ printf $1 } '`
        Hide
        Pavel Yaskevich added a comment -

        I understand that those are for network but I ask what do they mean in our case? Is this average time to receive data from other nodes? The reason why I ask is RdLat is directly correlates with those options so I'm trying to understand what does that imply. I tried your patch on the single node and I don't see such a dramatic performance increase with data fitting into memory and with data bigger then memory in the later case trunk with your patch performs even a bit worse than current code.

        Show
        Pavel Yaskevich added a comment - I understand that those are for network but I ask what do they mean in our case? Is this average time to receive data from other nodes? The reason why I ask is RdLat is directly correlates with those options so I'm trying to understand what does that imply. I tried your patch on the single node and I don't see such a dramatic performance increase with data fitting into memory and with data bigger then memory in the later case trunk with your patch performs even a bit worse than current code.
        Hide
        Vijay added a comment -

        NetRxKb and NetTxKb shows the network stats... this tools is a script which was written by Denis Sheahan <dsheahan@netflix.com> and used internally to measure the performance.
        the hot methods are from http://java.sun.com/developer/technicalArticles/Programming/perfanal/ look for caller calle.

        Even though i dont agree on the explantation, how do you explain a better response time 50% better read response times?
        I am not trying to optimize for my use case and if you have tried it and still think it is bad. Let me know i can close this ticket. Thanks

        Show
        Vijay added a comment - NetRxKb and NetTxKb shows the network stats... this tools is a script which was written by Denis Sheahan <dsheahan@netflix.com> and used internally to measure the performance. the hot methods are from http://java.sun.com/developer/technicalArticles/Programming/perfanal/ look for caller calle. Even though i dont agree on the explantation, how do you explain a better response time 50% better read response times? I am not trying to optimize for my use case and if you have tried it and still think it is bad. Let me know i can close this ticket. Thanks
        Hide
        Pavel Yaskevich added a comment -

        Pavel, it doesnt show the opposite it actually shows the time spent is 98% in the snappy library and only 2% in the remaining part of the code. Where as in the earlier case we spend 58% of the time in Snappy and rest in the other part of the code. Snappy/decompression is definitely the bottleneck... all i am saying is that now we are more efficient and thats the only bottleneck.

        Nacked percentages do not say anything that is why tool also shows the time spent in the method, and it says that with your patch snappy performance noticeably degraded. What I'm trying to say is, you might be trying to optimize the wrong thing which could lead to the degraded performance of the decompression, unpredictable consequences for SSTable release and possibly other implications that we don't see right now.

        Also I wanted to ask you: what do "NetRxKb" and "NetTxKb" show in this case (sorry I didn't use the tool you are using)? Looks like "RdLat" correlates with those properties.

        Plz note i am not selling this patch I am trying to find a better performance for our use case which needs compression... I am completely open for other options.

        I understand that but we can't really commit the code that optimizes one specific use-case that could lead to bad implications to other people.

        Show
        Pavel Yaskevich added a comment - Pavel, it doesnt show the opposite it actually shows the time spent is 98% in the snappy library and only 2% in the remaining part of the code. Where as in the earlier case we spend 58% of the time in Snappy and rest in the other part of the code. Snappy/decompression is definitely the bottleneck... all i am saying is that now we are more efficient and thats the only bottleneck. Nacked percentages do not say anything that is why tool also shows the time spent in the method, and it says that with your patch snappy performance noticeably degraded. What I'm trying to say is, you might be trying to optimize the wrong thing which could lead to the degraded performance of the decompression, unpredictable consequences for SSTable release and possibly other implications that we don't see right now. Also I wanted to ask you: what do "NetRxKb" and "NetTxKb" show in this case (sorry I didn't use the tool you are using)? Looks like "RdLat" correlates with those properties. Plz note i am not selling this patch I am trying to find a better performance for our use case which needs compression... I am completely open for other options. I understand that but we can't really commit the code that optimizes one specific use-case that could lead to bad implications to other people.
        Hide
        Vijay added a comment -

        constant performance => not a lot of difference from 95th percentile and Average. Before patch there was a huge swing between those. Data is shown above.

        Plz note i am not selling this patch I am trying to find a better performance for our use case which needs compression... I am completely open for other options.

        Show
        Vijay added a comment - constant performance => not a lot of difference from 95th percentile and Average. Before patch there was a huge swing between those. Data is shown above. Plz note i am not selling this patch I am trying to find a better performance for our use case which needs compression... I am completely open for other options.
        Hide
        Vijay added a comment -

        Pavel, it doesnt show the opposite it actually shows the time spent is 98% in the snappy library and only 2% in the remaining part of the code. Where as in the earlier case we spend 58% of the time in Snappy and rest in the other part of the code. Snappy/decompression is definitely the bottleneck... all i am saying is that now we are more efficient and thats the only bottleneck.

        "Did you mean compressed instead of uncompressed here?"
        Yes i ment compressed.

        Plz try a test before and after the patch you will see what i am talking about, I did run the cluster (before and after there isnt any other variable in play here) test it for a long time and after this patch shows constat performance and doesn't vary a lot (response times after the patch).

        Show
        Vijay added a comment - Pavel, it doesnt show the opposite it actually shows the time spent is 98% in the snappy library and only 2% in the remaining part of the code. Where as in the earlier case we spend 58% of the time in Snappy and rest in the other part of the code. Snappy/decompression is definitely the bottleneck... all i am saying is that now we are more efficient and thats the only bottleneck. "Did you mean compressed instead of uncompressed here?" Yes i ment compressed. Plz try a test before and after the patch you will see what i am talking about, I did run the cluster (before and after there isnt any other variable in play here) test it for a long time and after this patch shows constat performance and doesn't vary a lot (response times after the patch).
        Hide
        Pavel Yaskevich added a comment -

        We do get something like 50% better latencies by doing MMap'ed without copying the data.

        But hot methods show the oposite, the main thing that hurts performance in the normal read case is not memcopy but reader class initialization overhead.

        Snappy is 1.6% more because there isn't any thing else holding up or any other over head.

        I don't get what do you mean here, can you please elaborate? Slower snappy execution on my opinion could be caused by the additional expenses related to data mapping to the user-space in the conditions of the migrating page cache (situation when dataset does not fit in the page cache), mmap'ed I/O in that case makes kernel do more work comparing to syscalls (normal I/O).

        Currently with this patch we dont have to copy any uncompressed data but the CRAR will copy because we dont handle the DirectBB to snappy and that's made possible by using MMapped IO.

        Did you mean compressed instead of uncompressed here?

        Show
        Pavel Yaskevich added a comment - We do get something like 50% better latencies by doing MMap'ed without copying the data. But hot methods show the oposite, the main thing that hurts performance in the normal read case is not memcopy but reader class initialization overhead. Snappy is 1.6% more because there isn't any thing else holding up or any other over head. I don't get what do you mean here, can you please elaborate? Slower snappy execution on my opinion could be caused by the additional expenses related to data mapping to the user-space in the conditions of the migrating page cache (situation when dataset does not fit in the page cache), mmap'ed I/O in that case makes kernel do more work comparing to syscalls (normal I/O). Currently with this patch we dont have to copy any uncompressed data but the CRAR will copy because we dont handle the DirectBB to snappy and that's made possible by using MMapped IO. Did you mean compressed instead of uncompressed here?
        Hide
        Vijay added a comment -

        Regarding duplicates i was thinking of Creating duplicates in CMSF and having a helper function to track it.

        Regarding Hot Reads: (I tried before and you have to access the FD and caching the initialized object didn't help), We do get something like 50% better latencies by doing MMap'ed without copying the data. Snappy is 1.6% more because there isn't any thing else holding up or any other over head.

        Currently with this patch we dont have to copy any uncompressed data but the CRAR will copy because we dont handle the DirectBB to snappy and that's made possible by using MMapped IO.

        Show
        Vijay added a comment - Regarding duplicates i was thinking of Creating duplicates in CMSF and having a helper function to track it. Regarding Hot Reads: (I tried before and you have to access the FD and caching the initialized object didn't help), We do get something like 50% better latencies by doing MMap'ed without copying the data. Snappy is 1.6% more because there isn't any thing else holding up or any other over head. Currently with this patch we dont have to copy any uncompressed data but the CRAR will copy because we dont handle the DirectBB to snappy and that's made possible by using MMapped IO.
        Hide
        Pavel Yaskevich added a comment -

        Hot reads show the if we remove overhead of the CRAR and RAR initialization we would get the numbers very close to mmap'ed I/O, also as you can see that snappy takes ~1.6x time with mmap'ed I/O.

        Show
        Pavel Yaskevich added a comment - Hot reads show the if we remove overhead of the CRAR and RAR initialization we would get the numbers very close to mmap'ed I/O, also as you can see that snappy takes ~1.6x time with mmap'ed I/O.
        Hide
        Pavel Yaskevich added a comment -

        The problem is that you can't remove duplicate() because the same segment can be requested concurrently by different reads and we don't want to limit concurrency with synchronisation over segment use.

        Show
        Pavel Yaskevich added a comment - The problem is that you can't remove duplicate() because the same segment can be requested concurrently by different reads and we don't want to limit concurrency with synchronisation over segment use.
        Hide
        Vijay added a comment -

        BTW: i can remove the duplicate() i didnt realize the implications, If you think rest is fine.

        Show
        Vijay added a comment - BTW: i can remove the duplicate() i didnt realize the implications, If you think rest is fine.
        Hide
        Vijay added a comment -

        I did it Again, i confused everyone with my test data
        Hot methods shown above is the only data which is from the trunk rest are without CRC (hot methods without CRC and without this patch is as follows).

        Excl. User CPU Name

        sec. %
        629.460 100.00 <Total>
        336.913 53.52 <static>@0x54999 (<snappy-1.0.4.1-libsnappyjava.so>)
        50.074 7.96 org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(java.lang.String, org.apache.cassandra.io.compress.CompressionMetadata, boolean)
        43.057 6.84 org.apache.cassandra.io.util.RandomAccessReader.<init>(java.io.File, int, boolean)
        35.623 5.66 memcpy
        33.555 5.33 <static>@0xd8e9 (<libpthread-2.5.so>)
        30.673 4.87 Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned long)
        26.384 4.19 CompactibleFreeListSpace::block_size(const HeapWord*) const
        15.199 2.41 SpinPause
        11.966 1.90 BlockOffsetArrayNonContigSpace::block_start_unsafe(const void*) const
        8.479 1.35 CardTableModRefBSForCTRS::card_will_be_scanned(signed char)
        8.007 1.27 CardTableModRefBS::non_clean_card_iterate_work(MemRegion, MemRegionClosure*, bool)
        5.169 0.82 madvise
        5.059 0.80 ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
        4.146 0.66 CardTableModRefBS::process_chunk_boundaries(Space*, DirtyCardToOopClosure*, MemRegion, MemRegion, signed char**, unsigned long, unsigned long)
        2.431 0.39 CardTableModRefBS::dirty_card_range_after_reset(MemRegion, bool, int)
        1.375 0.22 SweepClosure::do_blk_careful(HeapWord*)
        0.825 0.13 Par_PushOrMarkClosure::do_oop(oopDesc*)
        0.616 0.10 GenericTaskQueue<oopDesc*, 131072>::pop_local(oopDesc*&)
        0.561 0.09 instanceKlass::oop_oop_iterate_nv(oopDesc*, Par_PushOrMarkClosure*)
        0.473 0.08 CardTableModRefBS::process_stride(Space*, MemRegion, int, int, DirtyCardToOopClosure*, MemRegionClosure*, bool, signed char**, unsigned long, unsigned long)
        0.374 0.06 Par_MarkFromRootsClosure::scan_oops_in_oop(HeapWord*)
        0.319 0.05 BitMap::par_at_put(unsigned long, bool)
        0.308 0.05 MemRegion::intersection(MemRegion) const
        0.275 0.04 munmap
        0.220 0.03 CardTableModRefBS::dirty_card_iterate(MemRegion, MemRegionClosure*)

        Hope this makes sense.

        Show
        Vijay added a comment - I did it Again, i confused everyone with my test data Hot methods shown above is the only data which is from the trunk rest are without CRC (hot methods without CRC and without this patch is as follows). Excl. User CPU Name sec. % 629.460 100.00 <Total> 336.913 53.52 <static>@0x54999 (<snappy-1.0.4.1-libsnappyjava.so>) 50.074 7.96 org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(java.lang.String, org.apache.cassandra.io.compress.CompressionMetadata, boolean) 43.057 6.84 org.apache.cassandra.io.util.RandomAccessReader.<init>(java.io.File, int, boolean) 35.623 5.66 memcpy 33.555 5.33 <static>@0xd8e9 (<libpthread-2.5.so>) 30.673 4.87 Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned long) 26.384 4.19 CompactibleFreeListSpace::block_size(const HeapWord*) const 15.199 2.41 SpinPause 11.966 1.90 BlockOffsetArrayNonContigSpace::block_start_unsafe(const void*) const 8.479 1.35 CardTableModRefBSForCTRS::card_will_be_scanned(signed char) 8.007 1.27 CardTableModRefBS::non_clean_card_iterate_work(MemRegion, MemRegionClosure*, bool) 5.169 0.82 madvise 5.059 0.80 ParallelTaskTerminator::offer_termination(TerminatorTerminator*) 4.146 0.66 CardTableModRefBS::process_chunk_boundaries(Space*, DirtyCardToOopClosure*, MemRegion, MemRegion, signed char**, unsigned long, unsigned long) 2.431 0.39 CardTableModRefBS::dirty_card_range_after_reset(MemRegion, bool, int) 1.375 0.22 SweepClosure::do_blk_careful(HeapWord*) 0.825 0.13 Par_PushOrMarkClosure::do_oop(oopDesc*) 0.616 0.10 GenericTaskQueue<oopDesc*, 131072>::pop_local(oopDesc*&) 0.561 0.09 instanceKlass::oop_oop_iterate_nv(oopDesc*, Par_PushOrMarkClosure*) 0.473 0.08 CardTableModRefBS::process_stride(Space*, MemRegion, int, int, DirtyCardToOopClosure*, MemRegionClosure*, bool, signed char**, unsigned long, unsigned long) 0.374 0.06 Par_MarkFromRootsClosure::scan_oops_in_oop(HeapWord*) 0.319 0.05 BitMap::par_at_put(unsigned long, bool) 0.308 0.05 MemRegion::intersection(MemRegion) const 0.275 0.04 munmap 0.220 0.03 CardTableModRefBS::dirty_card_iterate(MemRegion, MemRegionClosure*) Hope this makes sense.
        Hide
        Pavel Yaskevich added a comment -

        Can you please compare your version with trunk without crc32 because it doesn't seem to be fare match, would be nice to see the same statistics about hot methods and response time. The thing that I hate about MappedByteBuffer is if you duplicate it like you do in reBuffer() - will make unmap impossible until the every last duplicate is GC'ed, this implies that we won't be able to release old SSTables...

        Show
        Pavel Yaskevich added a comment - Can you please compare your version with trunk without crc32 because it doesn't seem to be fare match, would be nice to see the same statistics about hot methods and response time. The thing that I hate about MappedByteBuffer is if you duplicate it like you do in reBuffer() - will make unmap impossible until the every last duplicate is GC'ed, this implies that we won't be able to release old SSTables...
        Hide
        Vijay added a comment -

        The above test was done on 12 node cluster but the response time and the hot methods where collected from one random node in the cluster.
        This test was executed on AWS M2.4xl's with heap settings of 12/2.

        Show
        Vijay added a comment - The above test was done on 12 node cluster but the response time and the hot methods where collected from one random node in the cluster. This test was executed on AWS M2.4xl's with heap settings of 12/2.
        Hide
        Vijay added a comment - - edited

        Hot Methods before the patch (trunk, without any patch):
        Excl. User CPU Name

        sec. %
        1480.474 100.00 <Total>
        756.717 51.11 crc32
        387.767 26.19 <static>@0x54999 (<snappy-1.0.4.1-libsnappyjava.so>)
        54.814 3.70 org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(java.lang.String, org.apache.cassandra.io.compress.CompressionMetadata, boolean)
        46.676 3.15 org.apache.cassandra.io.util.RandomAccessReader.<init>(java.io.File, int, boolean)
        45.697 3.09 Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned long)
        39.417 2.66 memcpy
        36.931 2.49 <static>@0xd8e9 (<libpthread-2.5.so>)
        23.272 1.57 CompactibleFreeListSpace::block_size(const HeapWord*) const
        22.766 1.54 SpinPause
        12.593 0.85 BlockOffsetArrayNonContigSpace::block_start_unsafe(const void*) const
        9.304 0.63 CardTableModRefBSForCTRS::card_will_be_scanned(signed char)
        8.468 0.57 CardTableModRefBS::non_clean_card_iterate_work(MemRegion, MemRegionClosure*, bool)
        8.051 0.54 ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
        5.400 0.36 madvise
        4.619 0.31 CardTableModRefBS::process_chunk_boundaries(Space*, DirtyCardToOopClosure*, MemRegion, MemRegion, signed char**, unsigned long, unsigned long)
        1.584 0.11 CardTableModRefBS::dirty_card_range_after_reset(MemRegion, bool, int)
        1.551 0.10 SweepClosure::do_blk_careful(HeapWord*)

        Hot Methods After the patch:
        sec. %
        537.681 100.00 <Total>
        529.719 98.52 <static>@0x54999 (<snappy-1.0.4.1-libsnappyjava.so>)
        4.168 0.78 memcpy
        0.143 0.03 <Unknown>
        0.121 0.02 send
        0.121 0.02 sun.misc.Unsafe.park(boolean, long)
        0.110 0.02 sun.misc.Unsafe.unpark(java.lang.Object)
        0.088 0.02 Interpreter
        0.077 0.01 org.apache.cassandra.utils.EstimatedHistogram.max()
        0.077 0.01 recv
        0.066 0.01 SpinPause
        0.055 0.01 org.apache.cassandra.utils.EstimatedHistogram.mean()
        0.044 0.01 java.lang.Object.wait(long)
        0.044 0.01 org.apache.cassandra.utils.EstimatedHistogram.min()
        0.044 0.01 __pthread_cond_signal
        0.044 0.01 vtable stub
        0.033 0.01 java.lang.Object.notify()
        0.033 0.01 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(java.lang.Runnable)
        0.033 0.01 org.apache.cassandra.io.compress.CompressedMappedFileDataInput.read()
        0.033 0.01 PhaseLive::compute(unsigned)
        0.033 0.01 poll
        0.022 0.00 Arena::contains(const void*) const
        0.022 0.00 CompactibleFreeListSpace::free() const
        0.022 0.00 I2C/C2I adapters
        0.022 0.00 IndexSetIterator::advance_and_next()
        0.022 0.00 java.lang.Class.forName0(java.lang.String, boolean, java.lang.ClassLoader)
        0.022 0.00 java.lang.Long.getChars(long, int, char[])
        0.022 0.00 java.nio.Bits.swap(int)

        Before this patch response times (With crc chance set to 0):
        Epoch Rds/s RdLat Wrts/s WrtLat %user %sys %idle %iowait %steal md0r/s w/s rMB/s wMB/s NetRxKb NetTxKb Percentiles Read Write Compacts
        1324587443 15 186.305 0 0.000 27.85 0.02 71.83 0.24 0.05 3.89 0.00 0.12 0.00 41 45 99th 545.791 ms 95th 454.826 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324587455 15 1142.712 0 0.000 39.55 0.13 57.61 2.50 0.21 118.30 0.30 2.20 0.00 34 36 99th 8409.007 ms 95th 8409.007 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324587467 10 171.808 0 0.000 23.83 0.04 76.05 0.04 0.05 4.80 0.00 0.14 0.00 127 33 99th 454.826 ms 95th 315.852 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324587478 10 182.775 0 0.000 20.43 0.04 79.47 0.01 0.05 1.60 0.40 0.04 0.00 30 37 99th 379.022 ms 95th 379.022 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324587490 13 190.893 0 0.000 27.58 0.03 72.20 0.14 0.06 3.20 0.50 0.09 0.00 39 42 99th 545.791 ms 95th 379.022 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324587503 28 358.719 0 0.000 52.24 0.08 46.20 1.40 0.09 159.40 0.00 3.16 0.00 196 71 99th 3379.391 ms 95th 943.127 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324587517 13 194.281 0 0.000 16.68 0.02 83.23 0.04 0.02 2.40 0.30 0.07 0.00 38 41 99th 785.939 ms 95th 545.791 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324587535 36 662.410 0 0.000 58.34 0.08 41.42 0.06 0.10 3.60 0.20 0.11 0.00 173 81 99th 3379.391 ms 95th 2816.159 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324587547 22 189.838 0 0.000 37.68 0.05 62.03 0.16 0.09 5.32 0.49 0.16 0.00 56 63 99th 454.826 ms 95th 379.022 ms 99th 0.00 ms 95th 0.00 ms Pen/0

        After this patch response times:
        Epoch Rds/s RdLat Wrts/s WrtLat %user %sys %idle %iowait %steal md0r/s w/s rMB/s wMB/s NetRxKb NetTxKb Percentiles Read Write Compacts
        1324665227 18 97.724 0 0.000 21.49 0.02 78.40 0.05 0.04 4.00 0.40 0.12 0.00 167 45 99th 152.321 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665239 26 107.279 0 0.000 29.57 0.04 70.18 0.16 0.05 8.70 0.00 0.22 0.00 56 60 99th 219.342 ms 95th 152.321 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665251 27 105.965 0 0.000 28.37 0.05 70.97 0.54 0.08 6.49 0.60 0.11 0.00 70 73 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665262 21 103.396 0 0.000 22.84 0.03 77.08 0.01 0.04 0.80 0.10 0.03 0.00 43 46 99th 126.934 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665274 27 104.916 0 0.000 32.78 0.04 67.06 0.06 0.06 7.70 0.10 0.14 0.00 161 64 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665286 21 105.094 0 0.000 21.33 0.01 78.53 0.09 0.04 3.49 0.30 0.10 0.00 47 51 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665297 21 104.898 0 0.000 22.95 0.01 76.91 0.10 0.03 4.40 0.00 0.12 0.00 46 48 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665309 25 104.844 0 0.000 27.31 0.03 72.53 0.09 0.05 4.00 0.60 0.12 0.00 199 71 99th 152.321 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665321 26 106.604 0 0.000 32.63 0.05 66.99 0.27 0.06 5.40 0.10 0.11 0.00 54 57 99th 219.342 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665332 21 104.086 0 0.000 24.66 0.01 75.19 0.10 0.04 3.30 0.00 0.10 0.00 146 51 99th 152.321 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665344 24 108.079 0 0.000 29.26 0.04 70.50 0.15 0.06 3.10 0.40 0.09 0.00 56 59 99th 219.342 ms 95th 152.321 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665356 32 105.465 0 0.000 32.67 0.04 66.97 0.25 0.08 8.80 0.00 0.11 0.00 60 63 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665368 15 103.112 0 0.000 16.61 0.03 83.33 0.01 0.03 0.80 0.40 0.02 0.00 48 53 99th 126.934 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0
        1324665379 24 104.599 0 0.000 25.87 0.03 74.05 0.01 0.05 2.59 0.10 0.08 0.00 51 54 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0

        Looks like we have 50% better performance with this.

        Pavel, you are right the biggest gain was because we reduced the memcpy.

        Show
        Vijay added a comment - - edited Hot Methods before the patch (trunk, without any patch): Excl. User CPU Name sec. % 1480.474 100.00 <Total> 756.717 51.11 crc32 387.767 26.19 <static>@0x54999 (<snappy-1.0.4.1-libsnappyjava.so>) 54.814 3.70 org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(java.lang.String, org.apache.cassandra.io.compress.CompressionMetadata, boolean) 46.676 3.15 org.apache.cassandra.io.util.RandomAccessReader.<init>(java.io.File, int, boolean) 45.697 3.09 Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned long) 39.417 2.66 memcpy 36.931 2.49 <static>@0xd8e9 (<libpthread-2.5.so>) 23.272 1.57 CompactibleFreeListSpace::block_size(const HeapWord*) const 22.766 1.54 SpinPause 12.593 0.85 BlockOffsetArrayNonContigSpace::block_start_unsafe(const void*) const 9.304 0.63 CardTableModRefBSForCTRS::card_will_be_scanned(signed char) 8.468 0.57 CardTableModRefBS::non_clean_card_iterate_work(MemRegion, MemRegionClosure*, bool) 8.051 0.54 ParallelTaskTerminator::offer_termination(TerminatorTerminator*) 5.400 0.36 madvise 4.619 0.31 CardTableModRefBS::process_chunk_boundaries(Space*, DirtyCardToOopClosure*, MemRegion, MemRegion, signed char**, unsigned long, unsigned long) 1.584 0.11 CardTableModRefBS::dirty_card_range_after_reset(MemRegion, bool, int) 1.551 0.10 SweepClosure::do_blk_careful(HeapWord*) Hot Methods After the patch: sec. % 537.681 100.00 <Total> 529.719 98.52 <static>@0x54999 (<snappy-1.0.4.1-libsnappyjava.so>) 4.168 0.78 memcpy 0.143 0.03 <Unknown> 0.121 0.02 send 0.121 0.02 sun.misc.Unsafe.park(boolean, long) 0.110 0.02 sun.misc.Unsafe.unpark(java.lang.Object) 0.088 0.02 Interpreter 0.077 0.01 org.apache.cassandra.utils.EstimatedHistogram.max() 0.077 0.01 recv 0.066 0.01 SpinPause 0.055 0.01 org.apache.cassandra.utils.EstimatedHistogram.mean() 0.044 0.01 java.lang.Object.wait(long) 0.044 0.01 org.apache.cassandra.utils.EstimatedHistogram.min() 0.044 0.01 __pthread_cond_signal 0.044 0.01 vtable stub 0.033 0.01 java.lang.Object.notify() 0.033 0.01 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(java.lang.Runnable) 0.033 0.01 org.apache.cassandra.io.compress.CompressedMappedFileDataInput.read() 0.033 0.01 PhaseLive::compute(unsigned) 0.033 0.01 poll 0.022 0.00 Arena::contains(const void*) const 0.022 0.00 CompactibleFreeListSpace::free() const 0.022 0.00 I2C/C2I adapters 0.022 0.00 IndexSetIterator::advance_and_next() 0.022 0.00 java.lang.Class.forName0(java.lang.String, boolean, java.lang.ClassLoader) 0.022 0.00 java.lang.Long.getChars(long, int, char[]) 0.022 0.00 java.nio.Bits.swap(int) Before this patch response times (With crc chance set to 0): Epoch Rds/s RdLat Wrts/s WrtLat %user %sys %idle %iowait %steal md0r/s w/s rMB/s wMB/s NetRxKb NetTxKb Percentiles Read Write Compacts 1324587443 15 186.305 0 0.000 27.85 0.02 71.83 0.24 0.05 3.89 0.00 0.12 0.00 41 45 99th 545.791 ms 95th 454.826 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324587455 15 1142.712 0 0.000 39.55 0.13 57.61 2.50 0.21 118.30 0.30 2.20 0.00 34 36 99th 8409.007 ms 95th 8409.007 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324587467 10 171.808 0 0.000 23.83 0.04 76.05 0.04 0.05 4.80 0.00 0.14 0.00 127 33 99th 454.826 ms 95th 315.852 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324587478 10 182.775 0 0.000 20.43 0.04 79.47 0.01 0.05 1.60 0.40 0.04 0.00 30 37 99th 379.022 ms 95th 379.022 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324587490 13 190.893 0 0.000 27.58 0.03 72.20 0.14 0.06 3.20 0.50 0.09 0.00 39 42 99th 545.791 ms 95th 379.022 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324587503 28 358.719 0 0.000 52.24 0.08 46.20 1.40 0.09 159.40 0.00 3.16 0.00 196 71 99th 3379.391 ms 95th 943.127 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324587517 13 194.281 0 0.000 16.68 0.02 83.23 0.04 0.02 2.40 0.30 0.07 0.00 38 41 99th 785.939 ms 95th 545.791 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324587535 36 662.410 0 0.000 58.34 0.08 41.42 0.06 0.10 3.60 0.20 0.11 0.00 173 81 99th 3379.391 ms 95th 2816.159 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324587547 22 189.838 0 0.000 37.68 0.05 62.03 0.16 0.09 5.32 0.49 0.16 0.00 56 63 99th 454.826 ms 95th 379.022 ms 99th 0.00 ms 95th 0.00 ms Pen/0 After this patch response times: Epoch Rds/s RdLat Wrts/s WrtLat %user %sys %idle %iowait %steal md0r/s w/s rMB/s wMB/s NetRxKb NetTxKb Percentiles Read Write Compacts 1324665227 18 97.724 0 0.000 21.49 0.02 78.40 0.05 0.04 4.00 0.40 0.12 0.00 167 45 99th 152.321 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665239 26 107.279 0 0.000 29.57 0.04 70.18 0.16 0.05 8.70 0.00 0.22 0.00 56 60 99th 219.342 ms 95th 152.321 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665251 27 105.965 0 0.000 28.37 0.05 70.97 0.54 0.08 6.49 0.60 0.11 0.00 70 73 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665262 21 103.396 0 0.000 22.84 0.03 77.08 0.01 0.04 0.80 0.10 0.03 0.00 43 46 99th 126.934 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665274 27 104.916 0 0.000 32.78 0.04 67.06 0.06 0.06 7.70 0.10 0.14 0.00 161 64 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665286 21 105.094 0 0.000 21.33 0.01 78.53 0.09 0.04 3.49 0.30 0.10 0.00 47 51 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665297 21 104.898 0 0.000 22.95 0.01 76.91 0.10 0.03 4.40 0.00 0.12 0.00 46 48 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665309 25 104.844 0 0.000 27.31 0.03 72.53 0.09 0.05 4.00 0.60 0.12 0.00 199 71 99th 152.321 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665321 26 106.604 0 0.000 32.63 0.05 66.99 0.27 0.06 5.40 0.10 0.11 0.00 54 57 99th 219.342 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665332 21 104.086 0 0.000 24.66 0.01 75.19 0.10 0.04 3.30 0.00 0.10 0.00 146 51 99th 152.321 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665344 24 108.079 0 0.000 29.26 0.04 70.50 0.15 0.06 3.10 0.40 0.09 0.00 56 59 99th 219.342 ms 95th 152.321 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665356 32 105.465 0 0.000 32.67 0.04 66.97 0.25 0.08 8.80 0.00 0.11 0.00 60 63 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665368 15 103.112 0 0.000 16.61 0.03 83.33 0.01 0.03 0.80 0.40 0.02 0.00 48 53 99th 126.934 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 1324665379 24 104.599 0 0.000 25.87 0.03 74.05 0.01 0.05 2.59 0.10 0.08 0.00 51 54 99th 182.785 ms 95th 126.934 ms 99th 0.00 ms 95th 0.00 ms Pen/0 Looks like we have 50% better performance with this. Pavel, you are right the biggest gain was because we reduced the memcpy.
        Hide
        Vijay added a comment -

        Attached patch has optimization on memcpy which the earlier one didnt.

        Performance:
        Current trunk: 400+ms Avg
        Removing CRC (CASSANDRA-3611): 200+ms Avg
        With this patch: 100+ms Avg

        Show
        Vijay added a comment - Attached patch has optimization on memcpy which the earlier one didnt. Performance: Current trunk: 400+ms Avg Removing CRC ( CASSANDRA-3611 ): 200+ms Avg With this patch: 100+ms Avg
        Hide
        Vijay added a comment -

        Pravel, I already have the hot methods (this patch removes the method org.apache.cassandra.io.util.RandomAccessReader.<init>(java.io.File, int, boolean)) before and after... i will do an updated patch without mem copy and share the same the results.

        Show
        Vijay added a comment - Pravel, I already have the hot methods (this patch removes the method org.apache.cassandra.io.util.RandomAccessReader.<init>(java.io.File, int, boolean)) before and after... i will do an updated patch without mem copy and share the same the results.
        Hide
        Pavel Yaskevich added a comment -

        I'd like to see test and numbers like we did for BRAF related tickets CASSANDRA-1902 and CASSANDRA-1714 to be convinced.

        Show
        Pavel Yaskevich added a comment - I'd like to see test and numbers like we did for BRAF related tickets CASSANDRA-1902 and CASSANDRA-1714 to be convinced.
        Hide
        Sylvain Lebresne added a comment -

        When I was originally checking the compression code, I did a few quick stress tests under yourkit, and in cases where the data was small and thus entirely in page cache, the cost of creating a RandomAccessFile on each read was taking the majority of the time. Note that I'm not saying this to justify this ticket necessarily, since:

        1. we could use an object pool to avoid that cost
        2. my tests were really toy test, so they would need confirmation
          but mmapping does avoid this cost. Just saying.

        I'll also note that Snappy has a way to decompress data from direct ByteBuffer directly (snappydoc) so this could potentially avoid 1 copy (we would go from page cache to decompressed buffer directly). Of course we should look how well that works, but again, just to feed the discussion.

        Show
        Sylvain Lebresne added a comment - When I was originally checking the compression code, I did a few quick stress tests under yourkit, and in cases where the data was small and thus entirely in page cache, the cost of creating a RandomAccessFile on each read was taking the majority of the time. Note that I'm not saying this to justify this ticket necessarily, since: we could use an object pool to avoid that cost my tests were really toy test, so they would need confirmation but mmapping does avoid this cost. Just saying. I'll also note that Snappy has a way to decompress data from direct ByteBuffer directly ( snappydoc ) so this could potentially avoid 1 copy (we would go from page cache to decompressed buffer directly). Of course we should look how well that works, but again, just to feed the discussion.
        Hide
        Pavel Yaskevich added a comment -

        Cost of the syscall is negligable comparing cost of buffer copy and (in the worst case) I/O.

        Show
        Pavel Yaskevich added a comment - Cost of the syscall is negligable comparing cost of buffer copy and (in the worst case) I/O.
        Hide
        Vijay added a comment -

        How about the system calls and its over head? http://stackoverflow.com/questions/5614206/buffered-randomaccessfile-java, the point i am trying to make is, if you consider the chunk as hot and you would access them often enough (instead of considering it as a columns).

        Show
        Vijay added a comment - How about the system calls and its over head? http://stackoverflow.com/questions/5614206/buffered-randomaccessfile-java , the point i am trying to make is, if you consider the chunk as hot and you would access them often enough (instead of considering it as a columns).
        Hide
        Pavel Yaskevich added a comment -

        and if it is mmaped it will atleast have better performance (if you consider it as a hot compressed block and it is cached) as it might be from memory.

        Normal reads() and mmap'ed reads operate on the same page cache, the later would be faster only because it won't make a copy to the user buffer so as you do the copy anyway to don't see why it would be noticeably faster comparing normal read().

        Show
        Pavel Yaskevich added a comment - and if it is mmaped it will atleast have better performance (if you consider it as a hot compressed block and it is cached) as it might be from memory. Normal reads() and mmap'ed reads operate on the same page cache, the later would be faster only because it won't make a copy to the user buffer so as you do the copy anyway to don't see why it would be noticeably faster comparing normal read().
        Hide
        Vijay added a comment -

        Hi Pavel, Write()? i dont understand that part. (Are we talking about dirty pages?)

        Sure if you have flash drives or Disc's which are faster than memory there wont be any benefits but in general it increases the IO throughput and helps with the latency. Yes the copying will be unavoidable and it is no different than the existing disc reads we have to read the blocks to uncompress it, hence it isn't drastically different you are going to do IO for the skiped bytes in both the cases, and if it is mmaped it will atleast have better performance (if you consider it as a hot compressed block and it is cached) as it might be from memory. In our environment going to Disc (AWS) is really bad specially with the amount of disc activity will compressed CF do makes it unusable in the current state.

        BTW: i think i can remove copying in reBuffer(). if thats the only consern. I have to test it to see if it will be safe.

        Show
        Vijay added a comment - Hi Pavel, Write()? i dont understand that part. (Are we talking about dirty pages?) Sure if you have flash drives or Disc's which are faster than memory there wont be any benefits but in general it increases the IO throughput and helps with the latency. Yes the copying will be unavoidable and it is no different than the existing disc reads we have to read the blocks to uncompress it, hence it isn't drastically different you are going to do IO for the skiped bytes in both the cases, and if it is mmaped it will atleast have better performance (if you consider it as a hot compressed block and it is cached) as it might be from memory. In our environment going to Disc (AWS) is really bad specially with the amount of disc activity will compressed CF do makes it unusable in the current state. BTW: i think i can remove copying in reBuffer(). if thats the only consern. I have to test it to see if it will be safe.
        Hide
        Pavel Yaskevich added a comment -

        With read()/write() operations data gets copied 2 times - from disk to "page cache" and from "page cache" to "user buffer", the benefit of the mmap is that it allows to skip the second copy by directly mapping file (in case of file mapping) contents to the process address space. But if you take a look at CompressedFileDataInput.reBuffer() you would see that that second copy is still made. I don't see how mmap'ed I/O gives us any real benefit operating on the compressed files.

        Show
        Pavel Yaskevich added a comment - With read()/write() operations data gets copied 2 times - from disk to "page cache" and from "page cache" to "user buffer", the benefit of the mmap is that it allows to skip the second copy by directly mapping file (in case of file mapping) contents to the process address space. But if you take a look at CompressedFileDataInput.reBuffer() you would see that that second copy is still made. I don't see how mmap'ed I/O gives us any real benefit operating on the compressed files.
        Hide
        Vijay added a comment -

        Attached allows mmaped io on compressed SST's. We basically ignore the boundaries and split the files based on the chunks.

        Show
        Vijay added a comment - Attached allows mmaped io on compressed SST's. We basically ignore the boundaries and split the files based on the chunks.
        Hide
        Vijay added a comment -

        This gets complicated because the rows boundary is not the chunk boundary, I tried to map more blocks than needed for a given boundary but there is a possibility of running out of memory, hence i am planning to do the following:

        Make boundaries for the chunk instead of the row position
        Chunks between the boundaries will over-lap.

        Show
        Vijay added a comment - This gets complicated because the rows boundary is not the chunk boundary, I tried to map more blocks than needed for a given boundary but there is a possibility of running out of memory, hence i am planning to do the following: Make boundaries for the chunk instead of the row position Chunks between the boundaries will over-lap.

          People

          • Assignee:
            Vijay
            Reporter:
            Vijay
            Reviewer:
            Yuki Morishita
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development