Cassandra
  1. Cassandra
  2. CASSANDRA-4208

ColumnFamilyOutputFormat should support writing to multiple column families

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.0
    • Component/s: Hadoop
    • Labels:
      None

      Description

      It is not currently possible to output records to more than one column family in a single reducer. Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive. I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

      1. trunk-4208.txt
        9 kB
        Robbie Strickland
      2. trunk-4208-v2.txt
        11 kB
        Robbie Strickland
      3. cassandra-1.1-4208.txt
        6 kB
        Robbie Strickland
      4. cassandra-1.1-4208-v2.txt
        6 kB
        Robbie Strickland
      5. cassandra-1.1-4208-v3.txt
        6 kB
        Robbie Strickland
      6. cassandra-1.1-4208-v4.txt
        6 kB
        Robbie Strickland
      7. trunk-4208-v3.txt
        6 kB
        Robbie Strickland

        Issue Links

          Activity

          Gavin made changes -
          Workflow patch-available, re-open possible [ 12753552 ] reopen-resolved, no closed status, patch-avail, testing [ 12758802 ]
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12665020 ] patch-available, re-open possible [ 12753552 ]
          Jonathan Ellis made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Jonathan Ellis added a comment -

          Reverted the BOF change in 78d6f64f33c592890051c690ddf5d26b7b2af027

          Show
          Jonathan Ellis added a comment - Reverted the BOF change in 78d6f64f33c592890051c690ddf5d26b7b2af027
          Hide
          Michael Kjellman added a comment -

          Another question. why are we targeting 1.0.2 instead of 1.0.3 in build.xml?

          Show
          Michael Kjellman added a comment - Another question. why are we targeting 1.0.2 instead of 1.0.3 in build.xml?
          Hide
          Michael Kjellman added a comment -

          so are we going to revert commit e05a5fc12648f315002c9939a2a0748d74525589 and recommit minus the changes in the patch for BOF?

          Show
          Michael Kjellman added a comment - so are we going to revert commit e05a5fc12648f315002c9939a2a0748d74525589 and recommit minus the changes in the patch for BOF?
          Jonathan Ellis made changes -
          Fix Version/s 1.2.0 [ 12323243 ]
          Fix Version/s 1.2.0 beta 2 [ 12323284 ]
          Hide
          Michael Kjellman added a comment -

          Robbie - I'm okay with that. but not sure then we should have the BOF patch you provided applied if it doesn't work. I'm still working on debugging exactly why it doesn't stream but getting an environment setup to debug the whole process has been difficult.

          If anything maybe we should revert the change to BOF keep the other changes and then open another BOF bug for multiple output support?

          Show
          Michael Kjellman added a comment - Robbie - I'm okay with that. but not sure then we should have the BOF patch you provided applied if it doesn't work. I'm still working on debugging exactly why it doesn't stream but getting an environment setup to debug the whole process has been difficult. If anything maybe we should revert the change to BOF keep the other changes and then open another BOF bug for multiple output support?
          Hide
          Robbie Strickland added a comment -

          Michael Kjellman - I think the BOF support should be in a separate issue, since CFOF and BOF don't depend on each other for the MultipleOutputs functionality--and because this issue specifically addresses CFOF.

          Show
          Robbie Strickland added a comment - Michael Kjellman - I think the BOF support should be in a separate issue, since CFOF and BOF don't depend on each other for the MultipleOutputs functionality--and because this issue specifically addresses CFOF.
          Jonathan Ellis made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Michael Kjellman added a comment - - edited

          Jake or Robbie – have you tested this with BOF? I've confirmed that it looks like this only streams one of the two named multiple outputs. The sstables are created for both column families but the reducer never streams the data to the nodes.

          Show
          Michael Kjellman added a comment - - edited Jake or Robbie – have you tested this with BOF? I've confirmed that it looks like this only streams one of the two named multiple outputs. The sstables are created for both column families but the reducer never streams the data to the nodes.
          T Jake Luciani made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Assignee Robbie Strickland [ rstrickland ]
          Reviewer tjake
          Fix Version/s 1.2.0 beta 2 [ 12323284 ]
          Resolution Fixed [ 1 ]
          Hide
          T Jake Luciani added a comment -

          Committed thanks!

          Show
          T Jake Luciani added a comment - Committed thanks!
          Robbie Strickland made changes -
          Attachment trunk-4208-v3.txt [ 12546857 ]
          Hide
          Robbie Strickland added a comment -

          I've attached the new patch (v3) rebased against trunk.

          Show
          Robbie Strickland added a comment - I've attached the new patch (v3) rebased against trunk.
          Hide
          Robbie Strickland added a comment -

          Not a problem. I'll do so when I get back from Strange Loop...

          Show
          Robbie Strickland added a comment - Not a problem. I'll do so when I get back from Strange Loop...
          Hide
          T Jake Luciani added a comment -

          Hi Robbie, ready to commit this but the issue is we don't want to change hadoop versions on a stable branch 1.1

          Could you rebase your patch for trunk? 1.2 should be out soon.

          Show
          T Jake Luciani added a comment - Hi Robbie, ready to commit this but the issue is we don't want to change hadoop versions on a stable branch 1.1 Could you rebase your patch for trunk? 1.2 should be out soon.
          Hide
          Robbie Strickland added a comment -

          You don't need the Hadoop patch to make this work. I think I'm confused as to whether you're having trouble getting this to work at all, or just with BOF. As I mentioned I have not tested this with BOF, but it is working against 1.1.x & Hadoop 1.0.2 using CFOF. Look here for an example that works with CFOF: https://gist.github.com/3763728.

          Show
          Robbie Strickland added a comment - You don't need the Hadoop patch to make this work. I think I'm confused as to whether you're having trouble getting this to work at all, or just with BOF. As I mentioned I have not tested this with BOF, but it is working against 1.1.x & Hadoop 1.0.2 using CFOF. Look here for an example that works with CFOF: https://gist.github.com/3763728 .
          Hide
          Michael Kjellman added a comment -

          I applied the patch to Hadoop 1.0.3 as well. Are you suggesting then that for now this patch assumes those methods are still private?

          Show
          Michael Kjellman added a comment - I applied the patch to Hadoop 1.0.3 as well. Are you suggesting then that for now this patch assumes those methods are still private?
          Hide
          Michael Kjellman added a comment -

          I had already done what your patch contains. Only one SSTable gets created. Have you tested that patch? Am i missing something obvious with the job config requirements?

          Show
          Michael Kjellman added a comment - I had already done what your patch contains. Only one SSTable gets created. Have you tested that patch? Am i missing something obvious with the job config requirements?
          Hide
          Robbie Strickland added a comment -

          Michael Kjellman your usage is correct. What this patch does is actually change the ConfigHelper so set/getColumnFamily() operates on the mapreduce.output.basename key that MultipleOutputs (and FileInput/OutputFormat) uses when it's looking for outputs. This is a bit hacky but unavoidable since methods to alter this through the Hadoop API are inaccessible. I have a related ticket on the Hadoop side to change this and make it more generic, but until then this will have to do.

          Show
          Robbie Strickland added a comment - Michael Kjellman your usage is correct. What this patch does is actually change the ConfigHelper so set/getColumnFamily() operates on the mapreduce.output.basename key that MultipleOutputs (and FileInput/OutputFormat) uses when it's looking for outputs. This is a bit hacky but unavoidable since methods to alter this through the Hadoop API are inaccessible. I have a related ticket on the Hadoop side to change this and make it more generic, but until then this will have to do.
          Robbie Strickland made changes -
          Attachment cassandra-1.1-4208-v4.txt [ 12546084 ]
          Hide
          Robbie Strickland added a comment -

          I've attached a new patch that removes the check for a null output CF on BulkOutputFormat. This allows BOF to use the MultipleOutputs API.

          Show
          Robbie Strickland added a comment - I've attached a new patch that removes the check for a null output CF on BulkOutputFormat. This allows BOF to use the MultipleOutputs API.
          Hide
          Michael Kjellman added a comment - - edited

          Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never seems to set the column family.

          I would assume:

          ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE);
          MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
          MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);

          is all that is needed. If i don't setup the job with job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat throws an exception

          Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
          at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127)

          If i do specify that at the job level the job name never seems to to set the column family name on that job.

          additionally, using the job name as the column family name is slightly inconvenient as we use '_' in our column family names which is not a valid character in MultipleOutputs as it looks like _# is the way they internally keep track of counters if that is enabled.

          i would love to see the patch you are proposing to fix the issue for bulkoutputformat

          Show
          Michael Kjellman added a comment - - edited Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never seems to set the column family. I would assume: ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE); MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class); MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class); is all that is needed. If i don't setup the job with job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat throws an exception Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set. at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127) If i do specify that at the job level the job name never seems to to set the column family name on that job. additionally, using the job name as the column family name is slightly inconvenient as we use '_' in our column family names which is not a valid character in MultipleOutputs as it looks like _# is the way they internally keep track of counters if that is enabled. i would love to see the patch you are proposing to fix the issue for bulkoutputformat
          Hide
          Robbie Strickland added a comment - - edited

          You mean BulkOutputFormat isn't working, or MO isn't working at all? BulkOutputFormat isn't working because it's still checking to make sure the output CF has been set and throwing an exception otherwise. I'm happy to remove this check but we don't use BOF so I don't have the bandwidth to test. I'll create the patch if you want to do so.

          Show
          Robbie Strickland added a comment - - edited You mean BulkOutputFormat isn't working, or MO isn't working at all? BulkOutputFormat isn't working because it's still checking to make sure the output CF has been set and throwing an exception otherwise. I'm happy to remove this check but we don't use BOF so I don't have the bandwidth to test. I'll create the patch if you want to do so.
          Hide
          Michael Kjellman added a comment -

          so i've been working on this for a few days. As far as I can tell this is not working with 1.1.5 and 1.0.3. I've gone through and svn blammed and it doesn't look like anything exciting has really changed in the mapreduce code. Robbie have you tested this on the current GA versions?

          Show
          Michael Kjellman added a comment - so i've been working on this for a few days. As far as I can tell this is not working with 1.1.5 and 1.0.3. I've gone through and svn blammed and it doesn't look like anything exciting has really changed in the mapreduce code. Robbie have you tested this on the current GA versions?
          Hide
          Michael Kjellman added a comment - - edited

          yes- we have it working as well (and thanks for the patch, it's a really important feature to us). but so far we have been unsuccessful in getting it to work with bulkoutputformat...i'm going to work on debugging that today

          Show
          Michael Kjellman added a comment - - edited yes- we have it working as well (and thanks for the patch, it's a really important feature to us). but so far we have been unsuccessful in getting it to work with bulkoutputformat...i'm going to work on debugging that today
          Hide
          Robbie Strickland added a comment -

          The attached patch works and we have it running in production. I'm not sure why I haven't received any response since May on whether this will be included in some future release. I presume everyone is busy on other features.

          Show
          Robbie Strickland added a comment - The attached patch works and we have it running in production. I'm not sure why I haven't received any response since May on whether this will be included in some future release. I presume everyone is busy on other features.
          Hide
          Michael Kjellman added a comment -

          any additional updates on this? Robbie – what direction did you decide to pursue?

          Show
          Michael Kjellman added a comment - any additional updates on this? Robbie – what direction did you decide to pursue?
          Hide
          Robbie Strickland added a comment -

          I'd like to know if this is going to be included or if another direction is preferred. Any update?

          Show
          Robbie Strickland added a comment - I'd like to know if this is going to be included or if another direction is preferred. Any update?
          Hide
          Robbie Strickland added a comment -

          Any word on this?

          Show
          Robbie Strickland added a comment - Any word on this?
          Robbie Strickland made changes -
          Attachment cassandra-1.1-4208-v3.txt [ 12526391 ]
          Hide
          Robbie Strickland added a comment -

          Here's a new patch that handles the potential NPE on getOutputColumnFamily() and throws a more descriptive exception.

          Show
          Robbie Strickland added a comment - Here's a new patch that handles the potential NPE on getOutputColumnFamily() and throws a more descriptive exception.
          Hide
          T Jake Luciani added a comment -

          Well, there is always http://tutorials.jenkov.com/java-reflection/private-fields-and-methods.html#methods

          We use something like this in FBUtilities for accessing protected fields.

          I don't know how much worry a NPE should be, you could just add a log message if column family isn't set so people can see it before the NPE and realize they did something wrong.

          Show
          T Jake Luciani added a comment - Well, there is always http://tutorials.jenkov.com/java-reflection/private-fields-and-methods.html#methods We use something like this in FBUtilities for accessing protected fields. I don't know how much worry a NPE should be, you could just add a log message if column family isn't set so people can see it before the NPE and realize they did something wrong.
          Robbie Strickland made changes -
          Attachment cassandra-1.1-4208-v2.txt [ 12526340 ]
          Hide
          Robbie Strickland added a comment - - edited

          I've attached a patch that adds a setOutputColumnFamily() overload that takes in both keyspace and CF. The one outstanding issue that I've commented on in CFOF is that checkOutputSpecs() cannot currently ensure that a CF has been specified either through setOutputColumnFamily() or MultipleOutputs.

          Unfortunately MultipleOutputs.getNamedOutputsList(), which would be the right way to do this, is currently private. So we either don't do the check and let it throw an NPE at runtime, or we duplicate the code in MultipleOutputs to grab the values from config ourselves. Not sure which is the lesser of two evils.

          Show
          Robbie Strickland added a comment - - edited I've attached a patch that adds a setOutputColumnFamily() overload that takes in both keyspace and CF. The one outstanding issue that I've commented on in CFOF is that checkOutputSpecs() cannot currently ensure that a CF has been specified either through setOutputColumnFamily() or MultipleOutputs. Unfortunately MultipleOutputs.getNamedOutputsList(), which would be the right way to do this, is currently private. So we either don't do the check and let it throw an NPE at runtime, or we duplicate the code in MultipleOutputs to grab the values from config ourselves. Not sure which is the lesser of two evils.
          Hide
          T Jake Luciani added a comment - - edited

          I'm ok with this now that it works with MultipleOutputs (nice find), though I'm not sure if it should be in 1.1 since it would break existing scripts. Would you be able to make it backwards compatible by adding the old public static setOutputColumnFamily( public static void setOutputColumnFamily(Configuration conf, String keyspace, String columnFamily)) back and using the new setColumnFamily() in there?

          Show
          T Jake Luciani added a comment - - edited I'm ok with this now that it works with MultipleOutputs (nice find), though I'm not sure if it should be in 1.1 since it would break existing scripts. Would you be able to make it backwards compatible by adding the old public static setOutputColumnFamily( public static void setOutputColumnFamily(Configuration conf, String keyspace, String columnFamily)) back and using the new setColumnFamily() in there?
          Hide
          Robbie Strickland added a comment -

          Any word on whether this solution is getting the thumbs up? I personally need this functionality and would like to proceed in a manner that will ultimately be accepted by the community.

          Show
          Robbie Strickland added a comment - Any word on whether this solution is getting the thumbs up? I personally need this functionality and would like to proceed in a manner that will ultimately be accepted by the community.
          Robbie Strickland made changes -
          Attachment cassandra-1.1-4208.txt [ 12525637 ]
          Hide
          Robbie Strickland added a comment -

          It appears I was mistaken about the MultipleOutputs issue being resolved only in trunk. It's resolved in the mapred package in trunk, but the new version in mapreduce dates at least back to 1.0.1. It still references FileOutputFormat, but the attached patch gets around this by using the same config key. I have attached a new patch based against Cassandra 1.1 and Hadoop 1.0.2. Changes are actually minimal. Let me know your thoughts...

          Show
          Robbie Strickland added a comment - It appears I was mistaken about the MultipleOutputs issue being resolved only in trunk. It's resolved in the mapred package in trunk, but the new version in mapreduce dates at least back to 1.0.1. It still references FileOutputFormat, but the attached patch gets around this by using the same config key. I have attached a new patch based against Cassandra 1.1 and Hadoop 1.0.2. Changes are actually minimal. Let me know your thoughts...
          Hide
          Robbie Strickland added a comment -

          @Jonathan: Yes that is the patch, although the Hadoop patch is not required as long as you have the latest in trunk. The Hadoop patch just moves the call to set the base name out of FileOutputFormat and into OutputFormat--as a matter of principle and to avoid potential future issues.

          @Jake: Yes it is different. I examined prior branches to see where the changes were made, and it's only in trunk--which is why I didn't see it until checking out trunk to make the changes.

          It probably makes sense to do a patch against Hadoop 1.0.2 and Cassandra 1.1 so people can use a release version. This is definitely doable without significant effort.

          Show
          Robbie Strickland added a comment - @Jonathan: Yes that is the patch, although the Hadoop patch is not required as long as you have the latest in trunk. The Hadoop patch just moves the call to set the base name out of FileOutputFormat and into OutputFormat--as a matter of principle and to avoid potential future issues. @Jake: Yes it is different. I examined prior branches to see where the changes were made, and it's only in trunk--which is why I didn't see it until checking out trunk to make the changes. It probably makes sense to do a patch against Hadoop 1.0.2 and Cassandra 1.1 so people can use a release version. This is definitely doable without significant effort.
          Hide
          T Jake Luciani added a comment -

          @Robbie is the version in hadoop trunk different than the version included in MAPREDUCE-3607?

          Show
          T Jake Luciani added a comment - @Robbie is the version in hadoop trunk different than the version included in MAPREDUCE-3607 ?
          Hide
          Jonathan Ellis added a comment -

          I am submitting a patch to deal with an inconsistency that could cause future issues with non-file formats

          On MAPREDUCE-4216 or elsewhere?

          Show
          Jonathan Ellis added a comment - I am submitting a patch to deal with an inconsistency that could cause future issues with non-file formats On MAPREDUCE-4216 or elsewhere?
          Robbie Strickland made changes -
          Attachment trunk-4208-v2.txt [ 12525455 ]
          Hide
          Robbie Strickland added a comment -

          I've added a patch to allow support for MultipleOutputs. Hadoop trunk now contains a new version of MultipleOutputs that should support this out of the box, although I am submitting a patch to deal with an inconsistency that could cause future issues with non-file formats.

          The basic solution involves changing the config key for output CF to match the "basename" key being written by MultipleOutputs. I had to make related changes to CassandraStorage and TestRingCache, as well as some minor changes to ColumnFamilyInputFormat to account for some interface changes in Hadoop trunk.

          So the bottom line is this will work if people use Hadoop and Cassandra trunk with both patches applied. The original patch can be used as a temporary solution if needed.

          Show
          Robbie Strickland added a comment - I've added a patch to allow support for MultipleOutputs. Hadoop trunk now contains a new version of MultipleOutputs that should support this out of the box, although I am submitting a patch to deal with an inconsistency that could cause future issues with non-file formats. The basic solution involves changing the config key for output CF to match the "basename" key being written by MultipleOutputs. I had to make related changes to CassandraStorage and TestRingCache, as well as some minor changes to ColumnFamilyInputFormat to account for some interface changes in Hadoop trunk. So the bottom line is this will work if people use Hadoop and Cassandra trunk with both patches applied. The original patch can be used as a temporary solution if needed.
          Hide
          T Jake Luciani added a comment -

          @Robbie can you post your code analysis on the hadoop ticket?

          Show
          T Jake Luciani added a comment - @Robbie can you post your code analysis on the hadoop ticket?
          Hide
          T Jake Luciani added a comment -

          I would think the Hadoop community would go for it since they already do so much to decouple MR from HDFS.

          Let's ping them and see what they think, otherwise we could go with the less portable solution.

          Show
          T Jake Luciani added a comment - I would think the Hadoop community would go for it since they already do so much to decouple MR from HDFS. Let's ping them and see what they think, otherwise we could go with the less portable solution.
          Hide
          Robbie Strickland added a comment -

          I spent a good bit of time analyzing the changes needed to make this work using MultipleOutputs, and it would involve:

          1. Removing hard-coded references to WritableComparable and Writable in MultipleOutputs.getNamedOutputKeyClass() and getNamedOutputValueClass().
          2. Removing hard-coded call to FileOutputFormat.setOutputName() in getRecordWriter().
          3. Adding an abstract setOutputName() to OutputFormat so the call in #2 can be made generic. An alernative is a default no-op implementation so it doesn't break existing output formats who don't care about this.
          4. Implementing setOutputName() in ColumnFamilyOutputFormat, which would set the config property for the CF (where the "name" corresponds to CF).
          5. Separating CFOF.setColumnFamily() and setKeyspace(), where setColumnFamily() is just a pass-through to setOutputName() (or vice versa).

          This solution would allow MultipleOutputs support in conformance with the existing API, and it should not break any existing reducer code. I don't personally love the boilerplate it adds to my reducer, and I think it's much less obvious than handling it at the write() call, but I can get over that if I have to. I am willing to do the work on both sides if this is where the consensus is, though I don't know what the response will be in the Hadoop community.

          Thoughts?

          Show
          Robbie Strickland added a comment - I spent a good bit of time analyzing the changes needed to make this work using MultipleOutputs, and it would involve: 1. Removing hard-coded references to WritableComparable and Writable in MultipleOutputs.getNamedOutputKeyClass() and getNamedOutputValueClass(). 2. Removing hard-coded call to FileOutputFormat.setOutputName() in getRecordWriter(). 3. Adding an abstract setOutputName() to OutputFormat so the call in #2 can be made generic. An alernative is a default no-op implementation so it doesn't break existing output formats who don't care about this. 4. Implementing setOutputName() in ColumnFamilyOutputFormat, which would set the config property for the CF (where the "name" corresponds to CF). 5. Separating CFOF.setColumnFamily() and setKeyspace(), where setColumnFamily() is just a pass-through to setOutputName() (or vice versa). This solution would allow MultipleOutputs support in conformance with the existing API, and it should not break any existing reducer code. I don't personally love the boilerplate it adds to my reducer, and I think it's much less obvious than handling it at the write() call, but I can get over that if I have to. I am willing to do the work on both sides if this is where the consensus is, though I don't know what the response will be in the Hadoop community. Thoughts?
          Hide
          T Jake Luciani added a comment -

          My bad. I didn't notice the linked issue.

          Show
          T Jake Luciani added a comment - My bad. I didn't notice the linked issue.
          Robbie Strickland made changes -
          Comment [ I created an issue regarding the specificity of MultipleOutputs to FileOutputFormat. Linked here as an FYI. ]
          Robbie Strickland made changes -
          Link This issue relates to MAPREDUCE-4216 [ MAPREDUCE-4216 ]
          Hide
          Robbie Strickland added a comment -

          @Jake: MultipleOutputs is the class we've been referring to in the above posts, and it was around pre-1.0. Did you mean to refer to something else?

          Show
          Robbie Strickland added a comment - @Jake: MultipleOutputs is the class we've been referring to in the above posts, and it was around pre-1.0. Did you mean to refer to something else?
          Hide
          T Jake Luciani added a comment -
          Show
          T Jake Luciani added a comment - Could this be accomplished using http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html? It was recently added to Hadoop 1.0.2
          Hide
          Robbie Strickland added a comment -

          Looking a bit closer at the MultipleOutputs class, it seems pretty tied to FileOutputFormat. So if we go this route we're probably looking at a separate CassandraMultipleOutputs with little re-use from MultipleOutputs. We could re-use the config keys, but we'd have to duplicate the strings since they're private. Am I missing something that makes this more straightforward?

          Show
          Robbie Strickland added a comment - Looking a bit closer at the MultipleOutputs class, it seems pretty tied to FileOutputFormat. So if we go this route we're probably looking at a separate CassandraMultipleOutputs with little re-use from MultipleOutputs. We could re-use the config keys, but we'd have to duplicate the strings since they're private. Am I missing something that makes this more straightforward?
          Hide
          Robbie Strickland added a comment -

          We could use MultipleOutputs if you think that's better, though the implementation is certainly less trivial than what I've done here. Upside is of course sticking with the convention. I'm not really sure it gets us any more than that, and personally I think it adds unnecessary complexity to an already convoluted API. Passing in a CF at the call level is more intuitive and will be more familiar to Cassandra users, IMHO. But I'm happy to work on the MultipleOutputs version if that's the consensus.

          Show
          Robbie Strickland added a comment - We could use MultipleOutputs if you think that's better, though the implementation is certainly less trivial than what I've done here. Upside is of course sticking with the convention. I'm not really sure it gets us any more than that, and personally I think it adds unnecessary complexity to an already convoluted API. Passing in a CF at the call level is more intuitive and will be more familiar to Cassandra users, IMHO. But I'm happy to work on the MultipleOutputs version if that's the consensus.
          Hide
          Jonathan Ellis added a comment -

          Are you familiar with the Hadoop MultipleOutputs api? Seems like that's the "right" way to do this.

          Show
          Jonathan Ellis added a comment - Are you familiar with the Hadoop MultipleOutputs api? Seems like that's the "right" way to do this.
          Hide
          Robbie Strickland added a comment -

          I should note it would be easy to make this work with previous releases if desired. I think that was your real question...

          Show
          Robbie Strickland added a comment - I should note it would be easy to make this work with previous releases if desired. I think that was your real question...
          Hide
          Robbie Strickland added a comment -

          There is an API change, so when you do a context.write(), the signature now takes in a Pair<String, ByteBuffer> instead of just a ByteBuffer. I also changed ConfigHelper.setOutputColumnFamily() to setOutputKeyspace() and removed CF-related checks and config keys. It broke my existing reducers, but it's also an easy fix and adds tremendous value IMHO.

          Show
          Robbie Strickland added a comment - There is an API change, so when you do a context.write(), the signature now takes in a Pair<String, ByteBuffer> instead of just a ByteBuffer. I also changed ConfigHelper.setOutputColumnFamily() to setOutputKeyspace() and removed CF-related checks and config keys. It broke my existing reducers, but it's also an easy fix and adds tremendous value IMHO.
          Jonathan Ellis made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Jonathan Ellis added a comment -

          Thanks, Robbie.

          What are the backwards-compatibility effects here?

          Show
          Jonathan Ellis added a comment - Thanks, Robbie. What are the backwards-compatibility effects here?
          Robbie Strickland made changes -
          Field Original Value New Value
          Attachment trunk-4208.txt [ 12525187 ]
          Robbie Strickland created issue -

            People

            • Assignee:
              Robbie Strickland
              Reporter:
              Robbie Strickland
              Reviewer:
              T Jake Luciani
            • Votes:
              5 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development