Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: v0.9.0, v0.9.1, v0.9.2
    • Fix Version/s: v0.9.5
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      Since flume seems pretty extensible and there are those interested in using Cassandra as a logging store, it would be nice to have a Cassandra sink for Flume.

        Issue Links

          Activity

          Hide
          flume_thobbs added a comment -

          Has there been any work done on upgrading to the latest version of Thrift? Recent versions of Cassandra use a newer version of Thrift.

          Show
          flume_thobbs added a comment - Has there been any work done on upgrading to the latest version of Thrift? Recent versions of Cassandra use a newer version of Thrift.
          Hide
          Jonathan Hsieh added a comment -

          Tyler,

          Can you post the patch up for code review at review.cloudera.org? From a quick look at the patch (in a text editor) the patch is really big because of of the inclusion of some jars. We will probably want this to be modified to be "plugin" so the core doesn't have to have these dependencies.

          It is important for us to use "standard versions" of thrift – we use the 0.2 release. If there is a new release (0.3?) we may be able to switch to that, but I don't think we are willing to build against trunk.

          Thanks!
          Jon.

          Show
          Jonathan Hsieh added a comment - Tyler, Can you post the patch up for code review at review.cloudera.org? From a quick look at the patch (in a text editor) the patch is really big because of of the inclusion of some jars. We will probably want this to be modified to be "plugin" so the core doesn't have to have these dependencies. It is important for us to use "standard versions" of thrift – we use the 0.2 release. If there is a new release (0.3?) we may be able to switch to that, but I don't think we are willing to build against trunk. Thanks! Jon.
          Hide
          flume_thobbs added a comment -

          Jon,

          There is a new version of Thrift (0.3.0). I can certainly put that in instead (which I should have done the first time, my mistake).

          How should I go about making this a plugin?

          Show
          flume_thobbs added a comment - Jon, There is a new version of Thrift (0.3.0). I can certainly put that in instead (which I should have done the first time, my mistake). How should I go about making this a plugin?
          Hide
          flume_thobbs added a comment -

          Nevermind, I found the plugin documentation. I'll let you know if I have any trouble with that.

          Show
          flume_thobbs added a comment - Nevermind, I found the plugin documentation. I'll let you know if I have any trouble with that.
          Hide
          Jonathan Hsieh added a comment -

          Tyler,

          This looks pretty cool! Thanks for letting us know about Thrift 0.3.0 – looking into it I noticed the release date was about two weeks ago. We can should make that another JIRA to update to that version.

          A few questions:

          • Is this a blocker for cassandra compatibility?
          • I noticed no .thrift file and a few tests changed the size of data being transferred by a byte – do you know if 0.3.0 wire-compatible with 0.2.0? (would upgrading break scribe clients?)

          I did a quick scan of the code and have a few issues:

          • the open() call should grab resources, not the constructor. The constructor is often used without calling open (and a close should not be needed after a constructor call). It happens sometimes on the master when configs parsed and checked.
          • Need some unit tests. This is more lenient in the plug-in case, but still necessary.

          Jon

          Show
          Jonathan Hsieh added a comment - Tyler, This looks pretty cool! Thanks for letting us know about Thrift 0.3.0 – looking into it I noticed the release date was about two weeks ago. We can should make that another JIRA to update to that version. A few questions: Is this a blocker for cassandra compatibility? I noticed no .thrift file and a few tests changed the size of data being transferred by a byte – do you know if 0.3.0 wire-compatible with 0.2.0? (would upgrading break scribe clients?) I did a quick scan of the code and have a few issues: the open() call should grab resources, not the constructor. The constructor is often used without calling open (and a close should not be needed after a constructor call). It happens sometimes on the master when configs parsed and checked. Need some unit tests. This is more lenient in the plug-in case, but still necessary. Jon
          Hide
          flume_thobbs added a comment -

          Thanks for clarifying open() and close() for me; I've made changes to match that.

          Thift 0.3.0 should be backwards-compatible with Thrift 0.2.0. If you can easily test whether or not 0.3.0 breaks scribe clients, I would appreciate the feedback. I'm still working on rebuilding Hector (the Cassandra client) with Thrift 0.2.0, since that should theoretically work also.

          Show
          flume_thobbs added a comment - Thanks for clarifying open() and close() for me; I've made changes to match that. Thift 0.3.0 should be backwards-compatible with Thrift 0.2.0. If you can easily test whether or not 0.3.0 breaks scribe clients, I would appreciate the feedback. I'm still working on rebuilding Hector (the Cassandra client) with Thrift 0.2.0, since that should theoretically work also.
          Hide
          Disabled imported user added a comment -

          I'm just wondering though - does this type of thing belong in flume or in cassandra as a contrib module? With Cassandra's MapReduce and Pig support, it's in core Cassandra (well the Pig loadfunc will be going in the hadoop package in core cassandra). With the Hive storage handler, it's going into the Hive project. I'm just wondering how Flume handles various sinks - in the Flume project itself?

          Show
          Disabled imported user added a comment - I'm just wondering though - does this type of thing belong in flume or in cassandra as a contrib module? With Cassandra's MapReduce and Pig support, it's in core Cassandra (well the Pig loadfunc will be going in the hadoop package in core cassandra). With the Hive storage handler, it's going into the Hive project. I'm just wondering how Flume handles various sinks - in the Flume project itself?
          Hide
          Jonathan Hsieh added a comment -

          Flume has plugins/ directory, which are essentially equivalent to contrib code. We are thinking about separating this out in to a different repo to make getting plugins easier and to decouple the two. The negative of this approach is that some the plugins may get orphaned.

          My question here is, which community is more "responsible" for the code's its maintenance? Also, how was the decision made for the other projects?

          Show
          Jonathan Hsieh added a comment - Flume has plugins/ directory, which are essentially equivalent to contrib code. We are thinking about separating this out in to a different repo to make getting plugins easier and to decouple the two. The negative of this approach is that some the plugins may get orphaned. My question here is, which community is more "responsible" for the code's its maintenance? Also, how was the decision made for the other projects?
          Hide
          Jonathan Hsieh added a comment -

          I believe thrift 0.3.0 is wire compatible with 0.2.0 and the new 0.4.0. For thrift 0.3.0 in javaland, the interface between libthrift.jar and generated java code seems different. However, clients of the generated java code remained unchanged. For thrift v0.4.0, the api for the generated code has changed. (byte[] are now ByteBuffers in java land).

          Is the thrift versioning issue here due to flume and cassandra using different version of thrift in the same process/jvm?
          Would this affect the contents of the patch?

          Show
          Jonathan Hsieh added a comment - I believe thrift 0.3.0 is wire compatible with 0.2.0 and the new 0.4.0. For thrift 0.3.0 in javaland, the interface between libthrift.jar and generated java code seems different. However, clients of the generated java code remained unchanged. For thrift v0.4.0, the api for the generated code has changed. (byte[] are now ByteBuffers in java land). Is the thrift versioning issue here due to flume and cassandra using different version of thrift in the same process/jvm? Would this affect the contents of the patch?
          Hide
          flume_thobbs added a comment -

          Yes, primarily, using different versions of Thrift in the same process is the problem.

          The same version of Thrift could be used by both. This would require regenerating the Thrift generated bindings for the Cassandra client (as well as rebuilding the Cassandra) with an older version of Thrift each time that the Thrift API changed. Through the life of a major Cassandra release this shouldn't be too frequent, but it's painfully often during the beta phase right now. So, this would be a one-time thing for Cassandra 0.6.x, and hopefully the same for Cassandra 0.7.x once it's out of the beta phase.

          It's not a simple task, but if the plan moving forward is to offer support for different systems through plugins, making use of a plugin architecture like the Java Plugin Framework or OSGi might make sense. These are nice because they allow you to control how jars are used and even use different versions of the same library for different code. Using one of these would definitely solve two of the Cassandra problems: using a different Thrift jar, and supporting multiple Cassandra versions. I suspect you'll run into similar issues with other projects, too.

          If this is something that you're interested in doing, I might be able to spend some time helping.

          I'm not sure I understand your questions about affecting the contents of the patch. Do you mean if Thrift 0.4.0 were to be used instead?

          Show
          flume_thobbs added a comment - Yes, primarily, using different versions of Thrift in the same process is the problem. The same version of Thrift could be used by both. This would require regenerating the Thrift generated bindings for the Cassandra client (as well as rebuilding the Cassandra) with an older version of Thrift each time that the Thrift API changed. Through the life of a major Cassandra release this shouldn't be too frequent, but it's painfully often during the beta phase right now. So, this would be a one-time thing for Cassandra 0.6.x, and hopefully the same for Cassandra 0.7.x once it's out of the beta phase. It's not a simple task, but if the plan moving forward is to offer support for different systems through plugins, making use of a plugin architecture like the Java Plugin Framework or OSGi might make sense. These are nice because they allow you to control how jars are used and even use different versions of the same library for different code. Using one of these would definitely solve two of the Cassandra problems: using a different Thrift jar, and supporting multiple Cassandra versions. I suspect you'll run into similar issues with other projects, too. If this is something that you're interested in doing, I might be able to spend some time helping. I'm not sure I understand your questions about affecting the contents of the patch. Do you mean if Thrift 0.4.0 were to be used instead?
          Hide
          flume_thobbs added a comment -

          Well, I've figured out a slightly easier way to add Cassandra as a plugin sink. Support is not very sophisticated yet, but it does exist: http://github.com/thobbs/flume-cassandra-plugin

          It should be relatively easy to deal with different Thrift versions now, and different versions of Cassandra can be supported by different versions of the plugin. I'm satisfied with this if you are.

          Show
          flume_thobbs added a comment - Well, I've figured out a slightly easier way to add Cassandra as a plugin sink. Support is not very sophisticated yet, but it does exist: http://github.com/thobbs/flume-cassandra-plugin It should be relatively easy to deal with different Thrift versions now, and different versions of Cassandra can be supported by different versions of the plugin. I'm satisfied with this if you are.
          Hide
          Jonathan Hsieh added a comment -

          I did a high level look at it – let me make sure I understand the approach you took.

          The plugin has checked in a bunch of thrift-generated code (what version of thrift generated it?). Because thrift 0.2.0, 0.3.0, 0.4.0 are all wire compatible, the flume-cassandra plugin uses thrift-generated code that is the same version Flume is using. This prevents us from having thrift runtime library clashes even if Flume and Cassandra are running different version of Thrift. Also, to compile, we don't need the full Cassandra jar (which is nice). We eventually would need a different version of the Cassandra plugin as the thrift APIs for Cassandra change over time.

          Is this about right? If so, I think it is a reasonable approach.

          Some updates:
          Flume is about to move to Thrift 0.4.0 (patch has passed review and awaiting commit https://issues.cloudera.org/browse/FLUME-202).

          Some suggestions:
          Ideally some combination of comments/docs/classnames would have a Cassandra version number if Cassandra's thrift apis are still evolving. Ditto for having a Flume version number since Flume's thrift api may change and may be changing thrift versions.

          I don't think we have explicitly agreed where this plugin will live. By fiat, it looks in a separate repo. I'm fine with this approach – it is probably easier for all of us. A suggestion if we follow this approach is to add some info to the Flume documentation pointing to your repo and the instructions you have written up.

          Thanks,
          Jon

          Show
          Jonathan Hsieh added a comment - I did a high level look at it – let me make sure I understand the approach you took. The plugin has checked in a bunch of thrift-generated code (what version of thrift generated it?). Because thrift 0.2.0, 0.3.0, 0.4.0 are all wire compatible, the flume-cassandra plugin uses thrift-generated code that is the same version Flume is using. This prevents us from having thrift runtime library clashes even if Flume and Cassandra are running different version of Thrift. Also, to compile, we don't need the full Cassandra jar (which is nice). We eventually would need a different version of the Cassandra plugin as the thrift APIs for Cassandra change over time. Is this about right? If so, I think it is a reasonable approach. Some updates: Flume is about to move to Thrift 0.4.0 (patch has passed review and awaiting commit https://issues.cloudera.org/browse/FLUME-202 ). Some suggestions: Ideally some combination of comments/docs/classnames would have a Cassandra version number if Cassandra's thrift apis are still evolving. Ditto for having a Flume version number since Flume's thrift api may change and may be changing thrift versions. I don't think we have explicitly agreed where this plugin will live. By fiat, it looks in a separate repo. I'm fine with this approach – it is probably easier for all of us. A suggestion if we follow this approach is to add some info to the Flume documentation pointing to your repo and the instructions you have written up. Thanks, Jon
          Hide
          flume_thobbs added a comment -

          Jon,

          Yes, it sounds like you understand the approach.

          It is a little tricky to juggle the Flume and Cassandra versions, but I'm working to have something reasonable available for popular versions of each. I'll try to make sure the documentation on my end is clear about what version of the plugin works with each Flume/Cassandra version.

          Since I'm not as familiar with the Flume community, what versions of Flume do you suggest supporting? Do a lot of people follow trunk?

          Also, don't forget that you can fork the repo or copy the code somewhere into the Flume repo at any time!

          Show
          flume_thobbs added a comment - Jon, Yes, it sounds like you understand the approach. It is a little tricky to juggle the Flume and Cassandra versions, but I'm working to have something reasonable available for popular versions of each. I'll try to make sure the documentation on my end is clear about what version of the plugin works with each Flume/Cassandra version. Since I'm not as familiar with the Flume community, what versions of Flume do you suggest supporting? Do a lot of people follow trunk? Also, don't forget that you can fork the repo or copy the code somewhere into the Flume repo at any time!
          Hide
          Bruce Mitchener added a comment -

          The feature that I'm asking for in FLUME-220 may be useful / relevant here as well.

          Show
          Bruce Mitchener added a comment - The feature that I'm asking for in FLUME-220 may be useful / relevant here as well.
          Hide
          Disabled imported user added a comment -

          Since the plugin exists as a separate github project I marked this issue as resolved. I think it makes sense to have a separate project that can track both Cassandra releases as well as Flume updates. That way Tyler can update with changes to Cassandra more easily.

          Show
          Disabled imported user added a comment - Since the plugin exists as a separate github project I marked this issue as resolved. I think it makes sense to have a separate project that can track both Cassandra releases as well as Flume updates. That way Tyler can update with changes to Cassandra more easily.
          Hide
          Jonathan Hsieh added a comment -

          I've chatted with a few folks an IRC and I think that having plugins as separate projects makes sense. I think "including" cassandra in flume is a little silly and including flume in cassandra is also silly. I've updated the cloudera/flume github wiki page with links to the project (and others).

          At some point we will likely ivy/maven-ify things so that plugins can build against the proper versions of flume.

          Show
          Jonathan Hsieh added a comment - I've chatted with a few folks an IRC and I think that having plugins as separate projects makes sense. I think "including" cassandra in flume is a little silly and including flume in cassandra is also silly. I've updated the cloudera/flume github wiki page with links to the project (and others). At some point we will likely ivy/maven-ify things so that plugins can build against the proper versions of flume.
          Hide
          Jonathan Hsieh added a comment -

          Link to flume wiki with plugins list lives here: http://github.com/cloudera/flume/wiki

          Show
          Jonathan Hsieh added a comment - Link to flume wiki with plugins list lives here: http://github.com/cloudera/flume/wiki
          Hide
          Jonathan Hsieh added a comment -

          Update affects versions to reflect which versions it should work with.

          Show
          Jonathan Hsieh added a comment - Update affects versions to reflect which versions it should work with.

            People

            • Assignee:
              Unassigned
              Reporter:
              Disabled imported user
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development