Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      The standard answer since forever has been "cassandra is a bad fit for large objects."

      But I think it doesn't have to be that way. With a few simplifying assumptions we can make this doable.

      First, screw Thrift. There is no way to specify a stream of bytes cross-platform. You can't mix raw sockets into Thrift very easily so screw it. Make it an internal-only API to start with, like the much-vaunted and much-feared BinaryVerbHandler.

      Second, forget about writing multiple lobs at once. You insert one lob at a time, to a specific column.

      With Thrift out of the equation we are not out of the woods. MessagingService also assumes that Messages will be memory resident and not streamed. One approach to fix this would be to have a StreamingMessage class that consists of a message id (that would be paired w/ origination endpoint to make it unique) and a size. The VerbHandler would keep a Map of incomplete StreamingMessages around until the full size was read. Then they could be disposed of.

      So a LargeObjectCommand would be basically just the command id and the payload, the streamed lob. And we would handle it by streaming it directly to a file. When the stream was complete, we would do a write to the standard commitlog/memtable with a pointer to that lob file. That would then be flushed normally to the sstable. (This would require adding another boolean to Column serialization, whether the value is really a lob pointer. We could combine this with the existing bool into a single byte and have room for a couple more flags, without taking extra space.)

      So lobs would never appear directly in the commitlog, and we would never have to rewrite them multiple times during compaction; just the pointers would get merged, but the lob files themselves would not have to be touched. (Except to remove them when a compaction shows that an older version is no longer needed.)

      Then of course we'd need a corresponding ReadLargeObject command. So the basics are straightforward.

      Read Repair and Hinted Handoff would add a few more wrinkles but nothing fundamentally challenging.

      Thoughts?

        Activity

        Jonathan Ellis created issue -
        Hide
        Jonathan Ellis added a comment -

        After some more thought I came up with a straightforward (if clunky) way to support this in Thrift.

        (In my defense I note that it's already de rigeur to wrap the thrift "client" in something more idiomatic, and that [b]lob apis for traditional databases bear more than a little resemblance.)

        You would add these methods (actual names subject to bikeshedding):

        begin_lob(key, columnPath, size, ts) returns thrift_lob_id
        repeat until sum(byte.length) == size:
        stream_lob(thrift_lob_id, byte[])
        commit_lob(thrift_lob_id) throws Bad Stuff

        These would map fairly directly to the StreamingMessage/LargeObjectCommand structures described above.

        Some opinions that I am not married to:

        • we don't need block_for since streaming adds enough latency already that we want to just assume block_for=N
        • having a separate commit_lob is less magic than having the final stream_lob behave a little differently from the others
        Show
        Jonathan Ellis added a comment - After some more thought I came up with a straightforward (if clunky) way to support this in Thrift. (In my defense I note that it's already de rigeur to wrap the thrift "client" in something more idiomatic, and that [b] lob apis for traditional databases bear more than a little resemblance.) You would add these methods (actual names subject to bikeshedding): begin_lob(key, columnPath, size, ts) returns thrift_lob_id repeat until sum(byte.length) == size: stream_lob(thrift_lob_id, byte[]) commit_lob(thrift_lob_id) throws Bad Stuff These would map fairly directly to the StreamingMessage/LargeObjectCommand structures described above. Some opinions that I am not married to: we don't need block_for since streaming adds enough latency already that we want to just assume block_for=N having a separate commit_lob is less magic than having the final stream_lob behave a little differently from the others
        Hide
        Jonathan Ellis added a comment -

        Stu Hood points out that we'd want to store a hash of the file inline in the SSTable as part of the lob pointer to make repair checks more efficient.

        We'd also want to make sure that the key is part of the lob filename on the fs so that when moving data to another node we don't have to do deep inspection of the sstable contents.

        Show
        Jonathan Ellis added a comment - Stu Hood points out that we'd want to store a hash of the file inline in the SSTable as part of the lob pointer to make repair checks more efficient. We'd also want to make sure that the key is part of the lob filename on the fs so that when moving data to another node we don't have to do deep inspection of the sstable contents.
        Hide
        Jun Rao added a comment -

        Do we really want to store one blob per file? That could create too many files on a node.

        Show
        Jun Rao added a comment - Do we really want to store one blob per file? That could create too many files on a node.
        Hide
        Jonathan Ellis added a comment -

        either you do it one per file or you do it one per file AND compact to an sstable format later, or you could "reserve" space in a single file which would require extra seeks. Since lobs are (a) large by definition, so there is not much point to combining in a single file and (b) unlikely to change often, it seems like the cure would be worse than the disease.

        Show
        Jonathan Ellis added a comment - either you do it one per file or you do it one per file AND compact to an sstable format later, or you could "reserve" space in a single file which would require extra seeks. Since lobs are (a) large by definition, so there is not much point to combining in a single file and (b) unlikely to change often, it seems like the cure would be worse than the disease.
        Hide
        Eric Evans added a comment -

        I personally think that what is being proposed here is outside Cassandra's current scope, and does not fit well with the current design (as evidenced by the need for it be special-cased through the entire write-path).

        -1

        Show
        Eric Evans added a comment - I personally think that what is being proposed here is outside Cassandra's current scope, and does not fit well with the current design (as evidenced by the need for it be special-cased through the entire write-path). -1
        Hide
        Vijay added a comment -

        I would vote for Having an integration to HDFS, We should have a plugin to allow the HDFS to save the data in the same node as the data is served or atleast in the same rack... in this way we can serve the data faster with a very little latency, We can leveraged the DFS which is already tested by multiple people.

        Thanks
        VJ

        Show
        Vijay added a comment - I would vote for Having an integration to HDFS, We should have a plugin to allow the HDFS to save the data in the same node as the data is served or atleast in the same rack... in this way we can serve the data faster with a very little latency, We can leveraged the DFS which is already tested by multiple people. Thanks VJ
        Hide
        Jonathan Ellis added a comment -

        HDFS integration is a separate issue than native LOB support.

        Show
        Jonathan Ellis added a comment - HDFS integration is a separate issue than native LOB support.
        Jonathan Ellis made changes -
        Field Original Value New Value
        Component/s Core [ 12312978 ]
        Hide
        Jonathan Ellis added a comment -

        closing for now, afaik adding this is not on anyone's roadmap.

        Show
        Jonathan Ellis added a comment - closing for now, afaik adding this is not on anyone's roadmap.
        Jonathan Ellis made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12467147 ] patch-available, re-open possible [ 12749727 ]
        Gavin made changes -
        Workflow patch-available, re-open possible [ 12749727 ] reopen-resolved, no closed status, patch-avail, testing [ 12754391 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Jonathan Ellis
          • Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development