Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Fix Version/s: 3.0
    • Component/s: Core
    • Labels:
      None

      Description

      We could save some sstable space by encoding longs and ints as vlong and vint, respectively. (Probably most "short" lengths would be better as vint as well.)

        Activity

        Jonathan Ellis created issue -
        Hide
        Jonathan Ellis added a comment -

        We can borrow the vint/vlong implementation from http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/WritableUtils.html.

        The tricky part: these are used in the Messages as well. Either we need to version new Message serialization, or we need to keep Message (body) serialization on the old format and keep the new one for sstables.

        Show
        Jonathan Ellis added a comment - We can borrow the vint/vlong implementation from http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/WritableUtils.html . The tricky part: these are used in the Messages as well. Either we need to version new Message serialization, or we need to keep Message (body) serialization on the old format and keep the new one for sstables.
        Jonathan Ellis made changes -
        Field Original Value New Value
        Assignee Pavel Yaskevich [ xedin ]
        Hide
        Jonathan Ellis added a comment -

        For SSTable serialization start with SSTableWriter.append and follow the tendrils from there.

        For Messages we're only? concerned with RowSerializer and the ones it touches.

        For versioning look at Descriptor for the sstables, and MessagingService.version for the Messages. (All? the serializers used for Messages should already have a version parameter, e.g. RowSerializer.serialize, but if you run across one that does not, go ahead and add it.)

        Show
        Jonathan Ellis added a comment - For SSTable serialization start with SSTableWriter.append and follow the tendrils from there. For Messages we're only? concerned with RowSerializer and the ones it touches. For versioning look at Descriptor for the sstables, and MessagingService.version for the Messages. (All? the serializers used for Messages should already have a version parameter, e.g. RowSerializer.serialize, but if you run across one that does not, go ahead and add it.)
        Jonathan Ellis made changes -
        Summary sstable varint encoding sstable and message varint encoding
        Hide
        Vladimir Loncar added a comment -

        I will take a stab at this.

        Show
        Vladimir Loncar added a comment - I will take a stab at this.
        Hide
        Vladimir Loncar added a comment -

        Implementing this requires changing (de)serialize methods to use VInt and VLong. Now, while this works on new (empty) nodes, it of course breaks deserialization of existing SSTables (and hence most tests right now fail). What is the preferred approach to versioning serialization? I looked into adding new version into Descriptor, but now I need to somehow pass that information to serializers. I thought about adding descriptor parameter to them, so that serializers know whether to use readInt(Long) or readVInt(VLong). Also, should i only be worried about deserialization, or also serialization? (If serialization always uses latest version, we shouldn't worry about it?).

        Show
        Vladimir Loncar added a comment - Implementing this requires changing (de)serialize methods to use VInt and VLong. Now, while this works on new (empty) nodes, it of course breaks deserialization of existing SSTables (and hence most tests right now fail). What is the preferred approach to versioning serialization? I looked into adding new version into Descriptor, but now I need to somehow pass that information to serializers. I thought about adding descriptor parameter to them, so that serializers know whether to use readInt(Long) or readVInt(VLong). Also, should i only be worried about deserialization, or also serialization? (If serialization always uses latest version, we shouldn't worry about it?).
        Hide
        Vladimir Loncar added a comment -

        Unfortunately, due to time constraints, I am unable to continue working on this ticket. If anyone wishes to take over, feel free to do so. If not, I will try to find time in October to complete this.

        Show
        Vladimir Loncar added a comment - Unfortunately, due to time constraints, I am unable to continue working on this ticket. If anyone wishes to take over, feel free to do so. If not, I will try to find time in October to complete this.
        Jonathan Ellis made changes -
        Fix Version/s 1.1 [ 12317615 ]
        Fix Version/s 1.0 [ 12316349 ]
        Hide
        Terje Marthinussen added a comment -

        I have ported stuff related to handling this for columns/supercolumns (requires changes to size calculations as well) including more dense timestamp handling.

        I was looking quickly at the sstables as well as other places where we write short/ints/longs and I started realizing that modifying all of this gets a bit of a work.

        What about changing RandomAccessReader to something like (Descriptor desc,...) rather than (File file, ....) and adding backwards compatibility there for fixed length/ variable length readInt/Long there?

        On the write side, modify SequentialWriter with overrides for variable length write methods.

        Any objections against doing this at such a "low" level?

        Show
        Terje Marthinussen added a comment - I have ported stuff related to handling this for columns/supercolumns (requires changes to size calculations as well) including more dense timestamp handling. I was looking quickly at the sstables as well as other places where we write short/ints/longs and I started realizing that modifying all of this gets a bit of a work. What about changing RandomAccessReader to something like (Descriptor desc,...) rather than (File file, ....) and adding backwards compatibility there for fixed length/ variable length readInt/Long there? On the write side, modify SequentialWriter with overrides for variable length write methods. Any objections against doing this at such a "low" level?
        Hide
        Jonathan Ellis added a comment -

        Interesting idea. Seems like that might get tricky though where we are going from socket to file or vice versa in Streaming. And some places don't make sense to use vints, although the only ones I can think of are in checksums.

        To take a step back: how much does vint encoding actually buy us, with compression enabled?

        Show
        Jonathan Ellis added a comment - Interesting idea. Seems like that might get tricky though where we are going from socket to file or vice versa in Streaming. And some places don't make sense to use vints, although the only ones I can think of are in checksums. To take a step back: how much does vint encoding actually buy us, with compression enabled?
        Hide
        Vladimir Loncar added a comment -

        What about changing RandomAccessReader to something like (Descriptor desc,...) rather than (File file, ....) and adding backwards compatibility there for fixed length/ variable length readInt/Long there?

        That's the approach I investigated. First obstacle I came across was that you need to override readInt(Long) which is declared final, and since this impacts other places (as Jonathan mentioned, streaming is one area) where it might not bee needed, it required more thought about implementing right.

        Show
        Vladimir Loncar added a comment - What about changing RandomAccessReader to something like (Descriptor desc,...) rather than (File file, ....) and adding backwards compatibility there for fixed length/ variable length readInt/Long there? That's the approach I investigated. First obstacle I came across was that you need to override readInt(Long) which is declared final, and since this impacts other places (as Jonathan mentioned, streaming is one area) where it might not bee needed, it required more thought about implementing right.
        Hide
        Terje Marthinussen added a comment -

        For my test data (real life dataset using supercolumns) I got 10% additional size reduction over compression using vlq coded numbers on supercolumns/columns as well as column timestamps which are relative to the supercolumns rather than absolute timestamps.

        Already have some code for row relative timestamps, but not tested that yet

        Show
        Terje Marthinussen added a comment - For my test data (real life dataset using supercolumns) I got 10% additional size reduction over compression using vlq coded numbers on supercolumns/columns as well as column timestamps which are relative to the supercolumns rather than absolute timestamps. Already have some code for row relative timestamps, but not tested that yet
        Hide
        Terje Marthinussen added a comment -

        As a side note, is there any problems vlq encoding some "unneeded" values. If they are that rare, they probably have no real impact on performance?

        I did not run any real benchmarks yet, but json2sstable actually ran a tiny bit faster (1-2%) with vlq encoding and relative timestamps.

        Show
        Terje Marthinussen added a comment - As a side note, is there any problems vlq encoding some "unneeded" values. If they are that rare, they probably have no real impact on performance? I did not run any real benchmarks yet, but json2sstable actually ran a tiny bit faster (1-2%) with vlq encoding and relative timestamps.
        Hide
        Jonathan Ellis added a comment -

        Are you still working on a patch for this, Terje?

        Show
        Jonathan Ellis added a comment - Are you still working on a patch for this, Terje?
        Sylvain Lebresne made changes -
        Fix Version/s 1.1.1 [ 12319857 ]
        Fix Version/s 1.1 [ 12317615 ]
        Jonathan Ellis made changes -
        Fix Version/s 1.2 [ 12319262 ]
        Fix Version/s 1.1.1 [ 12319857 ]
        Vijay made changes -
        Assignee Vijay [ vijay2win@yahoo.com ]
        Jonathan Ellis made changes -
        Fix Version/s 1.3 [ 12322954 ]
        Fix Version/s 1.2.0 [ 12319262 ]
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12627052 ] patch-available, re-open possible [ 12752946 ]
        Gavin made changes -
        Workflow patch-available, re-open possible [ 12752946 ] reopen-resolved, no closed status, patch-avail, testing [ 12755627 ]
        Jonathan Ellis made changes -
        Fix Version/s 2.1 [ 12324159 ]
        Fix Version/s 2.0 [ 12322954 ]
        Sylvain Lebresne made changes -
        Fix Version/s 2.1 beta2 [ 12326276 ]
        Fix Version/s 2.1 [ 12324159 ]
        Sylvain Lebresne made changes -
        Fix Version/s 3.0 [ 12324945 ]
        Fix Version/s 2.1 beta2 [ 12326276 ]

          People

          • Assignee:
            Vijay
            Reporter:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development