Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We could save some sstable space by encoding longs and ints as vlong and vint, respectively. (Probably most "short" lengths would be better as vint as well.)

        Issue Links

          Activity

          Jonathan Ellis created issue -
          Hide
          Jonathan Ellis added a comment -

          We can borrow the vint/vlong implementation from http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/WritableUtils.html.

          The tricky part: these are used in the Messages as well. Either we need to version new Message serialization, or we need to keep Message (body) serialization on the old format and keep the new one for sstables.

          Show
          Jonathan Ellis added a comment - We can borrow the vint/vlong implementation from http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/WritableUtils.html . The tricky part: these are used in the Messages as well. Either we need to version new Message serialization, or we need to keep Message (body) serialization on the old format and keep the new one for sstables.
          Jonathan Ellis made changes -
          Field Original Value New Value
          Assignee Pavel Yaskevich [ xedin ]
          Hide
          Jonathan Ellis added a comment -

          For SSTable serialization start with SSTableWriter.append and follow the tendrils from there.

          For Messages we're only? concerned with RowSerializer and the ones it touches.

          For versioning look at Descriptor for the sstables, and MessagingService.version for the Messages. (All? the serializers used for Messages should already have a version parameter, e.g. RowSerializer.serialize, but if you run across one that does not, go ahead and add it.)

          Show
          Jonathan Ellis added a comment - For SSTable serialization start with SSTableWriter.append and follow the tendrils from there. For Messages we're only? concerned with RowSerializer and the ones it touches. For versioning look at Descriptor for the sstables, and MessagingService.version for the Messages. (All? the serializers used for Messages should already have a version parameter, e.g. RowSerializer.serialize, but if you run across one that does not, go ahead and add it.)
          Jonathan Ellis made changes -
          Summary sstable varint encoding sstable and message varint encoding
          Hide
          Vladimir Loncar added a comment -

          I will take a stab at this.

          Show
          Vladimir Loncar added a comment - I will take a stab at this.
          Hide
          Vladimir Loncar added a comment -

          Implementing this requires changing (de)serialize methods to use VInt and VLong. Now, while this works on new (empty) nodes, it of course breaks deserialization of existing SSTables (and hence most tests right now fail). What is the preferred approach to versioning serialization? I looked into adding new version into Descriptor, but now I need to somehow pass that information to serializers. I thought about adding descriptor parameter to them, so that serializers know whether to use readInt(Long) or readVInt(VLong). Also, should i only be worried about deserialization, or also serialization? (If serialization always uses latest version, we shouldn't worry about it?).

          Show
          Vladimir Loncar added a comment - Implementing this requires changing (de)serialize methods to use VInt and VLong. Now, while this works on new (empty) nodes, it of course breaks deserialization of existing SSTables (and hence most tests right now fail). What is the preferred approach to versioning serialization? I looked into adding new version into Descriptor, but now I need to somehow pass that information to serializers. I thought about adding descriptor parameter to them, so that serializers know whether to use readInt(Long) or readVInt(VLong). Also, should i only be worried about deserialization, or also serialization? (If serialization always uses latest version, we shouldn't worry about it?).
          Hide
          Vladimir Loncar added a comment -

          Unfortunately, due to time constraints, I am unable to continue working on this ticket. If anyone wishes to take over, feel free to do so. If not, I will try to find time in October to complete this.

          Show
          Vladimir Loncar added a comment - Unfortunately, due to time constraints, I am unable to continue working on this ticket. If anyone wishes to take over, feel free to do so. If not, I will try to find time in October to complete this.
          Jonathan Ellis made changes -
          Fix Version/s 1.1 [ 12317615 ]
          Fix Version/s 1.0 [ 12316349 ]
          Hide
          Terje Marthinussen added a comment -

          I have ported stuff related to handling this for columns/supercolumns (requires changes to size calculations as well) including more dense timestamp handling.

          I was looking quickly at the sstables as well as other places where we write short/ints/longs and I started realizing that modifying all of this gets a bit of a work.

          What about changing RandomAccessReader to something like (Descriptor desc,...) rather than (File file, ....) and adding backwards compatibility there for fixed length/ variable length readInt/Long there?

          On the write side, modify SequentialWriter with overrides for variable length write methods.

          Any objections against doing this at such a "low" level?

          Show
          Terje Marthinussen added a comment - I have ported stuff related to handling this for columns/supercolumns (requires changes to size calculations as well) including more dense timestamp handling. I was looking quickly at the sstables as well as other places where we write short/ints/longs and I started realizing that modifying all of this gets a bit of a work. What about changing RandomAccessReader to something like (Descriptor desc,...) rather than (File file, ....) and adding backwards compatibility there for fixed length/ variable length readInt/Long there? On the write side, modify SequentialWriter with overrides for variable length write methods. Any objections against doing this at such a "low" level?
          Hide
          Jonathan Ellis added a comment -

          Interesting idea. Seems like that might get tricky though where we are going from socket to file or vice versa in Streaming. And some places don't make sense to use vints, although the only ones I can think of are in checksums.

          To take a step back: how much does vint encoding actually buy us, with compression enabled?

          Show
          Jonathan Ellis added a comment - Interesting idea. Seems like that might get tricky though where we are going from socket to file or vice versa in Streaming. And some places don't make sense to use vints, although the only ones I can think of are in checksums. To take a step back: how much does vint encoding actually buy us, with compression enabled?
          Hide
          Vladimir Loncar added a comment -

          What about changing RandomAccessReader to something like (Descriptor desc,...) rather than (File file, ....) and adding backwards compatibility there for fixed length/ variable length readInt/Long there?

          That's the approach I investigated. First obstacle I came across was that you need to override readInt(Long) which is declared final, and since this impacts other places (as Jonathan mentioned, streaming is one area) where it might not bee needed, it required more thought about implementing right.

          Show
          Vladimir Loncar added a comment - What about changing RandomAccessReader to something like (Descriptor desc,...) rather than (File file, ....) and adding backwards compatibility there for fixed length/ variable length readInt/Long there? That's the approach I investigated. First obstacle I came across was that you need to override readInt(Long) which is declared final, and since this impacts other places (as Jonathan mentioned, streaming is one area) where it might not bee needed, it required more thought about implementing right.
          Hide
          Terje Marthinussen added a comment -

          For my test data (real life dataset using supercolumns) I got 10% additional size reduction over compression using vlq coded numbers on supercolumns/columns as well as column timestamps which are relative to the supercolumns rather than absolute timestamps.

          Already have some code for row relative timestamps, but not tested that yet

          Show
          Terje Marthinussen added a comment - For my test data (real life dataset using supercolumns) I got 10% additional size reduction over compression using vlq coded numbers on supercolumns/columns as well as column timestamps which are relative to the supercolumns rather than absolute timestamps. Already have some code for row relative timestamps, but not tested that yet
          Hide
          Terje Marthinussen added a comment -

          As a side note, is there any problems vlq encoding some "unneeded" values. If they are that rare, they probably have no real impact on performance?

          I did not run any real benchmarks yet, but json2sstable actually ran a tiny bit faster (1-2%) with vlq encoding and relative timestamps.

          Show
          Terje Marthinussen added a comment - As a side note, is there any problems vlq encoding some "unneeded" values. If they are that rare, they probably have no real impact on performance? I did not run any real benchmarks yet, but json2sstable actually ran a tiny bit faster (1-2%) with vlq encoding and relative timestamps.
          Hide
          Jonathan Ellis added a comment -

          Are you still working on a patch for this, Terje?

          Show
          Jonathan Ellis added a comment - Are you still working on a patch for this, Terje?
          Sylvain Lebresne made changes -
          Fix Version/s 1.1.1 [ 12319857 ]
          Fix Version/s 1.1 [ 12317615 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.2 [ 12319262 ]
          Fix Version/s 1.1.1 [ 12319857 ]
          Vijay made changes -
          Assignee Vijay [ vijay2win@yahoo.com ]
          Jonathan Ellis made changes -
          Fix Version/s 1.3 [ 12322954 ]
          Fix Version/s 1.2.0 [ 12319262 ]
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12627052 ] patch-available, re-open possible [ 12752946 ]
          Gavin made changes -
          Workflow patch-available, re-open possible [ 12752946 ] reopen-resolved, no closed status, patch-avail, testing [ 12755627 ]
          Jonathan Ellis made changes -
          Fix Version/s 2.1 [ 12324159 ]
          Fix Version/s 2.0 [ 12322954 ]
          Sylvain Lebresne made changes -
          Fix Version/s 2.1 beta2 [ 12326276 ]
          Fix Version/s 2.1 [ 12324159 ]
          Sylvain Lebresne made changes -
          Fix Version/s 3.0 [ 12324945 ]
          Fix Version/s 2.1 beta2 [ 12326276 ]
          T Jake Luciani made changes -
          Fix Version/s 3.x [ 12328789 ]
          Fix Version/s 3.0 [ 12324945 ]
          Aleksey Yeschenko made changes -
          Component/s Core [ 12312978 ]
          Hide
          Aleksey Yeschenko added a comment -

          With CASSANDRA-8099 and a bunch of follow-up tickets we now use varint encoding in most places where it makes sense.

          Show
          Aleksey Yeschenko added a comment - With CASSANDRA-8099 and a bunch of follow-up tickets we now use varint encoding in most places where it makes sense.
          Aleksey Yeschenko made changes -
          Link This issue duplicates CASSANDRA-8099 [ CASSANDRA-8099 ]
          Aleksey Yeschenko made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.x [ 12328789 ]
          Assignee Vijay [ vijay2win@yahoo.com ]
          Resolution Fixed [ 1 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          1679d 15h 21m 1 Aleksey Yeschenko 18/Mar/16 14:11

            People

            • Assignee:
              Unassigned
              Reporter:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development