Cassandra
  1. Cassandra
  2. CASSANDRA-4912

BulkOutputFormat should support Hadoop MultipleOutput

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: None
    • Component/s: Hadoop
    • Labels:
      None

      Description

      Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach taken in the patch for COF results in only one stream being sent and an exception being thrown when Hadoop is run in local mode due to the call to ConfigHelper when a new BulkRecordWriter is created.

      1. 4912.txt
        3 kB
        Michael Kjellman
      2. App.java
        7 kB
        Michael Kjellman
      3. loaddata.pl
        0.7 kB
        Michael Kjellman
      4. pom.xml
        3 kB
        Michael Kjellman

        Activity

        Hide
        Brandon Williams added a comment -

        Yep, seems to work. Committed, thanks.

        Show
        Brandon Williams added a comment - Yep, seems to work. Committed, thanks.
        Hide
        Michael Kjellman added a comment -

        Brandon Williams did everything compile okay for you?

        Show
        Michael Kjellman added a comment - Brandon Williams did everything compile okay for you?
        Hide
        Michael Kjellman added a comment -

        Brandon Williams also including a pom.xml for you if you decide to use maven for testing this.

        Show
        Michael Kjellman added a comment - Brandon Williams also including a pom.xml for you if you decide to use maven for testing this.
        Hide
        Michael Kjellman added a comment -

        Okay, attached a script to load data (really simple but wanted you to see what kind of data I was using to test that the Example job runs) and App.java which will allow you to output to multiple column families with BOF.

        Show
        Michael Kjellman added a comment - Okay, attached a script to load data (really simple but wanted you to see what kind of data I was using to test that the Example job runs) and App.java which will allow you to output to multiple column families with BOF.
        Hide
        Michael Kjellman added a comment -

        yeah sorry wasn't originally intended as a functional example. i'll create one that does something now.

        Show
        Michael Kjellman added a comment - yeah sorry wasn't originally intended as a functional example. i'll create one that does something now.
        Hide
        Brandon Williams added a comment -

        I still get a slew of errors trying to compile this. An obvious one is in ReducerToCassandra.reduce where 'val' is never defined, but there are many others.

        Show
        Brandon Williams added a comment - I still get a slew of errors trying to compile this. An obvious one is in ReducerToCassandra.reduce where 'val' is never defined, but there are many others.
        Hide
        Michael Kjellman added a comment -

        Updated example with imports.

        Show
        Michael Kjellman added a comment - Updated example with imports.
        Hide
        Brandon Williams added a comment -

        Do you have an Example.java that contains all the imports?

        Show
        Brandon Williams added a comment - Do you have an Example.java that contains all the imports?
        Hide
        Brandon Williams added a comment -

        There is no particular reason that I recall, it was just a convenient place at the time.

        Show
        Brandon Williams added a comment - There is no particular reason that I recall, it was just a convenient place at the time.
        Hide
        Michael Kjellman added a comment -

        Propose to set outputdir after the instantiation to add support for MultipleOutputs with BulkOutputFormat

        Show
        Michael Kjellman added a comment - Propose to set outputdir after the instantiation to add support for MultipleOutputs with BulkOutputFormat
        Hide
        Michael Kjellman added a comment -

        okay so it looks like setting outputdir in the creation of the object is causing the problem. I moved setting outputdir into prepareWriter() and it looks like both sstables are created and streamed.

        Brandon Williams any reason the outputdir is created when the BulkRecordWriter object is created?

        Show
        Michael Kjellman added a comment - okay so it looks like setting outputdir in the creation of the object is causing the problem. I moved setting outputdir into prepareWriter() and it looks like both sstables are created and streamed. Brandon Williams any reason the outputdir is created when the BulkRecordWriter object is created?
        Hide
        Michael Kjellman added a comment -

        I think also another difference in behavior between CFOF and BOF is that when a new BulkRecordWriter(Configuration conf) is created it creates the directory for the sstables. It calls ConfigHelper here to get the name of the column family so it can create the directory. The only call to getOutputColumnFamily is RangeClient in CFOF.

        Normally, without MultipleOutputs the job config would include a setOutputColumnFamily(). I don't understand what calls setOutputColumnFamily when you add a new named MultipleOutput. I presume this is where the problem is.

        Show
        Michael Kjellman added a comment - I think also another difference in behavior between CFOF and BOF is that when a new BulkRecordWriter(Configuration conf) is created it creates the directory for the sstables. It calls ConfigHelper here to get the name of the column family so it can create the directory. The only call to getOutputColumnFamily is RangeClient in CFOF. Normally, without MultipleOutputs the job config would include a setOutputColumnFamily(). I don't understand what calls setOutputColumnFamily when you add a new named MultipleOutput. I presume this is where the problem is.
        Hide
        Michael Kjellman added a comment -

        looks like OUTPUT_COLUMNFAMILY_CONFIG never gets set in ConfigHelper when a a new BulkRecordWriter is created. Difficult to figure out exactly what should/where the code should be setting mapreduce.output.basename in the job config.

        Show
        Michael Kjellman added a comment - looks like OUTPUT_COLUMNFAMILY_CONFIG never gets set in ConfigHelper when a a new BulkRecordWriter is created. Difficult to figure out exactly what should/where the code should be setting mapreduce.output.basename in the job config.
        Hide
        Michael Kjellman added a comment -

        Brandon Williams If I patch BulkOutputFormat.java in a similar manner to CASSANDRA-4208 (line 40) this is what is causing the initial check of the config to pass but fail when the reducer is created. Still not sure why the behavior is different.

        Show
        Michael Kjellman added a comment - Brandon Williams If I patch BulkOutputFormat.java in a similar manner to CASSANDRA-4208 (line 40) this is what is causing the initial check of the config to pass but fail when the reducer is created. Still not sure why the behavior is different.
        Hide
        Michael Kjellman added a comment - - edited

        So when ConfigHelper calls checkOutputSpecs() in local mode when the job is setup we don't throw any exceptions. When a reducer is created however org.apache.cassandra.hadoop.ConfigHelper.getOutputColumnFamily throws a UnsupportedOperationException that the output column family isn't setup. It looks like mapreduce.output.basename is null.

        See Example.java attached as a stripped down example MR job.

        Show
        Michael Kjellman added a comment - - edited So when ConfigHelper calls checkOutputSpecs() in local mode when the job is setup we don't throw any exceptions. When a reducer is created however org.apache.cassandra.hadoop.ConfigHelper.getOutputColumnFamily throws a UnsupportedOperationException that the output column family isn't setup. It looks like mapreduce.output.basename is null. See Example.java attached as a stripped down example MR job.
        Hide
        Brandon Williams added a comment -

        Hmm, normally I find the opposite: local mode works, and then everything breaks in distributed mode Can you post everything needed to test?

        Show
        Brandon Williams added a comment - Hmm, normally I find the opposite: local mode works, and then everything breaks in distributed mode Can you post everything needed to test?
        Hide
        Michael Kjellman added a comment - - edited

        so obviously this is due to the handling in the close() function in BulkRecordWriter. So far i've been unable to get BOF to work in Local mode thru eclipse with MultipleOutput. ConfigHelper is happy on the first check of the job config, but when the reducer is instantiated the column family output names don't seem to be set. close() is pretty simple in BulkRecordWriter though, looks like the sstable is first closed, and then streamed to the nodes. I'm guessing that either close() is only being called on one of the sstables/named outputs (i do see in a fully distributed cluster the sstables get created for multiple column families).

        Show
        Michael Kjellman added a comment - - edited so obviously this is due to the handling in the close() function in BulkRecordWriter. So far i've been unable to get BOF to work in Local mode thru eclipse with MultipleOutput. ConfigHelper is happy on the first check of the job config, but when the reducer is instantiated the column family output names don't seem to be set. close() is pretty simple in BulkRecordWriter though, looks like the sstable is first closed, and then streamed to the nodes. I'm guessing that either close() is only being called on one of the sstables/named outputs (i do see in a fully distributed cluster the sstables get created for multiple column families).

          People

          • Assignee:
            Michael Kjellman
            Reporter:
            Michael Kjellman
            Reviewer:
            Brandon Williams
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development