Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: CQL
    • Labels:
      None
    • Environment:

      Fedora11, JDK1.6.0_20

      Description

      For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack. The implementation details are attached as a patch. The patch works on Cassandra 0.7.0-beta3. Please check it.

      MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement. MessagePack allows reducing serialization cost and data size in network and disk.

      MessagePack websites are

      Performance of the data serialization library is one of the most important issues for developing a distributed database in Java. If the performance is bad, it significantly reduces the overall database performance. Java's GC also runs many times. Cassandra has this problem as well.

      For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC. The implementation is very simple. MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
      while adapting MessagePack's communication protocol and data serialization.

      Major features of MessagePack-RPC are

      The attached patch includes a ring cache program for MessagePack and its test program.
      You can check the behavior of the Cassandra RPC with MessagePack.

      Thanks in advance,

        Issue Links

          Activity

          Hide
          jbellis Jonathan Ellis added a comment -

          Gary did some tests in CASSANDRA-1765 and found no significant advantage over Thrift. Given that, and our brief experience supporting a second rpc protocol (Avro in the 0.7 series), I don't think this is going anywhere.

          Show
          jbellis Jonathan Ellis added a comment - Gary did some tests in CASSANDRA-1765 and found no significant advantage over Thrift. Given that, and our brief experience supporting a second rpc protocol (Avro in the 0.7 series), I don't think this is going anywhere.
          Hide
          parl Parlo Mendez added a comment -

          The last post is some time ago. What is the current status of messagepack implementation in cassandra? I think it would be very nice.

          Parlo

          Show
          parl Parlo Mendez added a comment - The last post is some time ago. What is the current status of messagepack implementation in cassandra? I think it would be very nice. Parlo
          Hide
          muga_nishizawa Muga Nishizawa added a comment -

          Hi T Jake Luciani,

          I would like to notify that we have cleared the license issues with MessagePack.

          As you pointed out earlier, MessagePack used to require XNIO (LGPL) for network communication. We replaced XNIO with Apache MINA (Apache License) in MessagePack. Javassist which was another issue is a dual license (LGPL and MPL) module, and is used by other apache products as MPL.

          So we believe that we have cleared license related issues at the moment.

          Please check URL below for more details.
          https://github.com/msgpack/msgpack/
          https://github.com/msgpack/msgpack-rpc/

          Show
          muga_nishizawa Muga Nishizawa added a comment - Hi T Jake Luciani, I would like to notify that we have cleared the license issues with MessagePack. As you pointed out earlier, MessagePack used to require XNIO (LGPL) for network communication. We replaced XNIO with Apache MINA (Apache License) in MessagePack. Javassist which was another issue is a dual license (LGPL and MPL) module, and is used by other apache products as MPL. So we believe that we have cleared license related issues at the moment. Please check URL below for more details. https://github.com/msgpack/msgpack/ https://github.com/msgpack/msgpack-rpc/
          Hide
          tjake T Jake Luciani added a comment -

          It appears msgpack requires jassist and xnio both of which are LGPL.

          This means we can't include msgpack support in our disrtibution see http://www.apache.org/legal/3party.html

          Show
          tjake T Jake Luciani added a comment - It appears msgpack requires jassist and xnio both of which are LGPL. This means we can't include msgpack support in our disrtibution see http://www.apache.org/legal/3party.html
          Hide
          jbellis Jonathan Ellis added a comment -

          Gary wrote some performance tests in CASSANDRA-1765 and saw MessagePack performance worse than Thrift. Is something wrong with his code?

          Show
          jbellis Jonathan Ellis added a comment - Gary wrote some performance tests in CASSANDRA-1765 and saw MessagePack performance worse than Thrift. Is something wrong with his code?
          Hide
          muga_nishizawa Muga Nishizawa added a comment -

          Additionally I also compared the amount of data transferred in network between MessagePack protocol and that of Thrift.

          (Summary)

          The amount of transferred data using MessagePack (or its protocol) is 20% less than that of Thrfit.

          (Test environment)

          I used ifconfig on a machine where Cassandra node runs. While accessing to the Cassandra node with client program, I monitored RX (downloading) and TX (uploading) data displayed by ifconfig. Client program was based on ring cache and executed random read and write requests 10,000 times.

          (Results)

          • Random read with MessagePack (RX: 1722828 bytes, TX: 1369345 bytes)
          • Random write with MessagePack (RX: 1831990 bytes, TX: 1228501 bytes)
          • Random read with Thrift (RX: 2232822 bytes, TX: 1987473 bytes)
          • Random write with Thrift (RX: 2522280 bytes, TX: 1607606 bytes)

          Of course, objects with Cassandra and sizes vary by users. In this evaluation, the size of data that I used was small so MessagePack significantly reduced the amount of transferred data compared to Thrift.

          Show
          muga_nishizawa Muga Nishizawa added a comment - Additionally I also compared the amount of data transferred in network between MessagePack protocol and that of Thrift. (Summary) The amount of transferred data using MessagePack (or its protocol) is 20% less than that of Thrfit. (Test environment) I used ifconfig on a machine where Cassandra node runs. While accessing to the Cassandra node with client program, I monitored RX (downloading) and TX (uploading) data displayed by ifconfig. Client program was based on ring cache and executed random read and write requests 10,000 times. (Results) Random read with MessagePack (RX: 1722828 bytes, TX: 1369345 bytes) Random write with MessagePack (RX: 1831990 bytes, TX: 1228501 bytes) Random read with Thrift (RX: 2232822 bytes, TX: 1987473 bytes) Random write with Thrift (RX: 2522280 bytes, TX: 1607606 bytes) Of course, objects with Cassandra and sizes vary by users. In this evaluation, the size of data that I used was small so MessagePack significantly reduced the amount of transferred data compared to Thrift.
          Hide
          muga_nishizawa Muga Nishizawa added a comment - - edited

          Jonathan,

          Thanks for your response.

          >What kind of performance improvement do you see with this patch?

          Performance improvement available with this patch will be the following:

          • Reducing serialization cost and the data size
          • Increase throughput between clients and a Cassandra node

          I have also measured the performance of MessagePack, from the viewpoints of reducing serialization cost and throughput. I will discuss details below.

          == Reduction of serialization cost and the data size ==

          (Summary)
          MessagePack has proved to be better in reducing serialzation cost and the data size compared to other serialization libraries in the test below.

          (Test environment)
          I used "jvm-serializers" which is a well-known benchmark and compared performances with Protocol Buffers, Thrift, and Avro. Machine used for this benchmark has Core2 Duo 2GHz with 1GB RAM.

          (Results)
                create  ser +same deser +shal +deep total size +dfl
          protobuf    683 6016 2973  3338  3454 3759 9775 239 149
          thrift      572 6287 5565  3479  3616 3770 10057 349 197
          msgpack    291 4935 4750  3468  3545 3708 8748 236 150
          avro     2698 6409 3623  7480  9301 10481 16890 221 133

          (Comments)
          It may be better to compare serialization cost using objects with Cassandra like a Column object. But such objects and sizes vary by users, and is not suitable for comparing serialization cost of various data. According to the above result, the size of MessagePack's serialized data is slightly larger than Avro. But MessagePack has significantly low serialization cost compared to Avro and Thrift.

          == Increasing throughput ==

          (Summary)
          I compared MessagePack based RPC of Cassandra to that of Thrift. Random read throughput of MessagePack based RPC is 15% higher than that of Thrift and random write throughput is 21% higher.

          (Test environment)
          In this evaluation, Cassandra node ran as a standalone on a machine with Core2 Duo 2GHz and 1GB RAM. Client programs ran on two machines both with Core2 Duo 2GHz and 1GB RAM. Client program was based on ring cache. It created 100 threads per a JVM on each machine and accesses to a Cassandra node with ring cache.

          (Results)

          • Thrift based RPC part of Cassandra(read: 5,200 query/sec., write: 11,200 query/sec.)
          • MessagePack based RPC part of Cassandra (read: 6,000 query/sec., write: 13,600 query/sec.)

          (Comments)
          I measured the max throughput of random access (read/write) after 100 items (size of each item is small) were stored in the Cassandra node. The reason is because I wanted to make the state of CPU bottle neck for the Cassandra node. If the Cassandra node is the state of Disk IO bottle neck, I thought that I cannot properly evaluate max throughput of the RPC part.

          I did not measure the amount of data transferred in network during the evaluation directly. But from the benchmark result of jvm-serializers, I believe that the amount of transferred data for MessagePack-based Cassandra would be reduced compared to that of Thrift.

          Show
          muga_nishizawa Muga Nishizawa added a comment - - edited Jonathan, Thanks for your response. >What kind of performance improvement do you see with this patch? Performance improvement available with this patch will be the following: Reducing serialization cost and the data size Increase throughput between clients and a Cassandra node I have also measured the performance of MessagePack, from the viewpoints of reducing serialization cost and throughput. I will discuss details below. == Reduction of serialization cost and the data size == (Summary) MessagePack has proved to be better in reducing serialzation cost and the data size compared to other serialization libraries in the test below. (Test environment) I used "jvm-serializers" which is a well-known benchmark and compared performances with Protocol Buffers, Thrift, and Avro. Machine used for this benchmark has Core2 Duo 2GHz with 1GB RAM. (Results)       create  ser +same deser +shal +deep total size +dfl protobuf    683 6016 2973  3338  3454 3759 9775 239 149 thrift      572 6287 5565  3479  3616 3770 10057 349 197 msgpack    291 4935 4750  3468  3545 3708 8748 236 150 avro     2698 6409 3623  7480  9301 10481 16890 221 133 (Comments) It may be better to compare serialization cost using objects with Cassandra like a Column object. But such objects and sizes vary by users, and is not suitable for comparing serialization cost of various data. According to the above result, the size of MessagePack's serialized data is slightly larger than Avro. But MessagePack has significantly low serialization cost compared to Avro and Thrift. == Increasing throughput == (Summary) I compared MessagePack based RPC of Cassandra to that of Thrift. Random read throughput of MessagePack based RPC is 15% higher than that of Thrift and random write throughput is 21% higher. (Test environment) In this evaluation, Cassandra node ran as a standalone on a machine with Core2 Duo 2GHz and 1GB RAM. Client programs ran on two machines both with Core2 Duo 2GHz and 1GB RAM. Client program was based on ring cache. It created 100 threads per a JVM on each machine and accesses to a Cassandra node with ring cache. (Results) Thrift based RPC part of Cassandra(read: 5,200 query/sec., write: 11,200 query/sec.) MessagePack based RPC part of Cassandra (read: 6,000 query/sec., write: 13,600 query/sec.) (Comments) I measured the max throughput of random access (read/write) after 100 items (size of each item is small) were stored in the Cassandra node. The reason is because I wanted to make the state of CPU bottle neck for the Cassandra node. If the Cassandra node is the state of Disk IO bottle neck, I thought that I cannot properly evaluate max throughput of the RPC part. I did not measure the amount of data transferred in network during the evaluation directly. But from the benchmark result of jvm-serializers, I believe that the amount of transferred data for MessagePack-based Cassandra would be reduced compared to that of Thrift.
          Hide
          terjem Terje Marthinussen added a comment -

          I am very curious how the serialization in messagepack could compete with the serialization used on the data side for cassandra (SSTables) and how we could benefit from having the same serialization in both those places.

          Anyone has any thoughts?

          Show
          terjem Terje Marthinussen added a comment - I am very curious how the serialization in messagepack could compete with the serialization used on the data side for cassandra (SSTables) and how we could benefit from having the same serialization in both those places. Anyone has any thoughts?
          Hide
          jbellis Jonathan Ellis added a comment -

          Thanks, this is exciting!

          What kind of performance improvement do you see with this patch?

          Show
          jbellis Jonathan Ellis added a comment - Thanks, this is exciting! What kind of performance improvement do you see with this patch?
          Hide
          muga_nishizawa Muga Nishizawa added a comment -

          I ) Cassandra RPC wich MessagePack
          2) dependency libraries

          Show
          muga_nishizawa Muga Nishizawa added a comment - I ) Cassandra RPC wich MessagePack 2) dependency libraries

            People

            • Assignee:
              Unassigned
              Reporter:
              muga_nishizawa Muga Nishizawa
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development