Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 0.7 beta 2
    • Component/s: None
    • Labels:
      None

      Description

      Hadoop Streaming is a framework that allows mapreduce jobs to be written in languages other than Java, by performing simple IPC on stdin/stdout.

      Adding output support for Hadoop Streaming to Cassandra would mean that users could write very simple scripts in dynamic languages to load data into Cassandra. Once our Hadoop OutputFormat has stabilized a bit, we might also be able to this code to provide scalable bulk loading.

        Issue Links

          Activity

          Hide
          stuhood Stu Hood added a comment -

          0001 - The Hadoop version available directly from Apache is missing some streaming patches that allow for binary data (HADOOP-1722)

          0002 - Adds implementations of Hadoop Streaming interfaces that parse incoming binary data

          0003 - Applies the deprecated Hadoop 0.18 'mapred' OutputFormat interface (as opposed to 0.20's 'mapreduce') to our OutputFormat, since Streaming has not been ported yet

          0004 - Adds a word count example in Python using Hadoop Streaming to count words in an input text file, and write them to Cassandra

          Show
          stuhood Stu Hood added a comment - 0001 - The Hadoop version available directly from Apache is missing some streaming patches that allow for binary data ( HADOOP-1722 ) 0002 - Adds implementations of Hadoop Streaming interfaces that parse incoming binary data 0003 - Applies the deprecated Hadoop 0.18 'mapred' OutputFormat interface (as opposed to 0.20's 'mapreduce') to our OutputFormat, since Streaming has not been ported yet 0004 - Adds a word count example in Python using Hadoop Streaming to count words in an input text file, and write them to Cassandra
          Hide
          stuhood Stu Hood added a comment -

          The second half of contrib/hadoop_streaming_output/bin/reducer.py is the part that actually interacts with our OutputFormat: users create 'StreamingMutation' objects, and write them to stdout using the Avro API.

          Show
          stuhood Stu Hood added a comment - The second half of contrib/hadoop_streaming_output/bin/reducer.py is the part that actually interacts with our OutputFormat: users create 'StreamingMutation' objects, and write them to stdout using the Avro API.
          Hide
          stuhood Stu Hood added a comment -

          Depends on the Avro changes on 1315.

          Show
          stuhood Stu Hood added a comment - Depends on the Avro changes on 1315.
          Hide
          hudson Hudson added a comment -

          Integrated in Cassandra #514 (See http://hudson.zones.apache.org/hudson/job/Cassandra/514/)
          optimize [Time|Lexical]UUIDType comparison further. patch by Folke Behrens; reviewed by jbellis for CASSANDRA-1368

          Show
          hudson Hudson added a comment - Integrated in Cassandra #514 (See http://hudson.zones.apache.org/hudson/job/Cassandra/514/ ) optimize [Time|Lexical] UUIDType comparison further. patch by Folke Behrens; reviewed by jbellis for CASSANDRA-1368
          Hide
          stuhood Stu Hood added a comment -

          Rebased for trunk: still applies atop CASSANDRA-1315

          Show
          stuhood Stu Hood added a comment - Rebased for trunk: still applies atop CASSANDRA-1315
          Hide
          jbellis Jonathan Ellis added a comment -

          wouldn't json be an even better fit than thrift or avro?

          Show
          jbellis Jonathan Ellis added a comment - wouldn't json be an even better fit than thrift or avro?
          Hide
          jbellis Jonathan Ellis added a comment -

          or even simpler: allow specifying separator characters for rows and columns (iianm this is what regular hadoop streaming does)

          Show
          jbellis Jonathan Ellis added a comment - or even simpler: allow specifying separator characters for rows and columns (iianm this is what regular hadoop streaming does)
          Hide
          stuhood Stu Hood added a comment -

          > wouldn't json be an even better fit than thrift or avro?
          Thrift and Avro serialization exist because JSON is not a nice way to deal with tons of data (especially binary data).

          > or even simpler: allow specifying separator characters for rows and columns (iianm this is what regular hadoop streaming does)
          I think you are seriously underestimating the can of worms this would be, and it wouldn't even get you timestamp support.

          Show
          stuhood Stu Hood added a comment - > wouldn't json be an even better fit than thrift or avro? Thrift and Avro serialization exist because JSON is not a nice way to deal with tons of data (especially binary data). > or even simpler: allow specifying separator characters for rows and columns (iianm this is what regular hadoop streaming does) I think you are seriously underestimating the can of worms this would be, and it wouldn't even get you timestamp support.
          Hide
          jbellis Jonathan Ellis added a comment -

          Thrift and Avro serialization exist because JSON is not a nice way to deal with tons of data (especially binary data).

          We've introduced support for annotating data with types, so you can represent a long as a long and a uuid as a pretty string, instead of everything being opaque binary.

          I worry that the Avro cure is worse than the disease.

          I think you are seriously underestimating the can of worms this would be, and it wouldn't even get you timestamp support.

          Maybe. But ColumnOrSuperColumn isn't a whole lot better, and has the drawback of inflicting Yet Another Serialization Format on people to learn.

          Show
          jbellis Jonathan Ellis added a comment - Thrift and Avro serialization exist because JSON is not a nice way to deal with tons of data (especially binary data). We've introduced support for annotating data with types, so you can represent a long as a long and a uuid as a pretty string, instead of everything being opaque binary. I worry that the Avro cure is worse than the disease. I think you are seriously underestimating the can of worms this would be, and it wouldn't even get you timestamp support. Maybe. But ColumnOrSuperColumn isn't a whole lot better, and has the drawback of inflicting Yet Another Serialization Format on people to learn.
          Hide
          stuhood Stu Hood added a comment -

          > But ColumnOrSuperColumn isn't a whole lot better
          This argument applies equally well to our client API.

          Switching to JSON in our client API would arguably have less effect than switching to JSON here, since client interactions are more frequently bottlenecked by network latency, while a streaming API should always be bottlenecked on throughput. Smaller objects are better in both locations, but gain us more benefit here.

          > and has the drawback of inflicting Yet Another Serialization Format
          Considering that the entire interaction with Avro is ~20 lines of code (most of which is simply creating dictionaries, which you would have to do for JSON serialization anyway), I don't think we're inconveniencing folks.

          Show
          stuhood Stu Hood added a comment - > But ColumnOrSuperColumn isn't a whole lot better This argument applies equally well to our client API. Switching to JSON in our client API would arguably have less effect than switching to JSON here, since client interactions are more frequently bottlenecked by network latency, while a streaming API should always be bottlenecked on throughput. Smaller objects are better in both locations, but gain us more benefit here. > and has the drawback of inflicting Yet Another Serialization Format Considering that the entire interaction with Avro is ~20 lines of code (most of which is simply creating dictionaries, which you would have to do for JSON serialization anyway), I don't think we're inconveniencing folks.
          Hide
          jbellis Jonathan Ellis added a comment -

          can you add an example using streaming on the input side?

          Show
          jbellis Jonathan Ellis added a comment - can you add an example using streaming on the input side?
          Hide
          stuhood Stu Hood added a comment -

          > can you add an example using streaming on the input side?
          It's not implemented: streaming.AvroOutputReader only implements outputting to Cassandra from streaming.

          Show
          stuhood Stu Hood added a comment - > can you add an example using streaming on the input side? It's not implemented: streaming.AvroOutputReader only implements outputting to Cassandra from streaming.
          Hide
          jbellis Jonathan Ellis added a comment -

          If the client might be using an alternate Avro schema, they can specify it using the OUTPUT_SCHEMA_KEY

          Is this likely to come up in practice or can we get rid of it?

          Show
          jbellis Jonathan Ellis added a comment - If the client might be using an alternate Avro schema, they can specify it using the OUTPUT_SCHEMA_KEY Is this likely to come up in practice or can we get rid of it?
          Hide
          stuhood Stu Hood added a comment -

          > Is this likely to come up in practice or can we get rid of it?
          Ack... I don't think it is actually implemented in this patch yet. Without adding it, changing the Avro client API will break Hadoop Streaming clients.

          I should fix that before we commit.

          Show
          stuhood Stu Hood added a comment - > Is this likely to come up in practice or can we get rid of it? Ack... I don't think it is actually implemented in this patch yet. Without adding it, changing the Avro client API will break Hadoop Streaming clients. I should fix that before we commit.
          Hide
          jbellis Jonathan Ellis added a comment -

          but we're okay if we change the avro client api in backwards-compatible ways, right?

          i'd say adding OUTPUT_SCHEMA_KEY is worth putting in the if/when it's actually a problem category

          Show
          jbellis Jonathan Ellis added a comment - but we're okay if we change the avro client api in backwards-compatible ways, right? i'd say adding OUTPUT_SCHEMA_KEY is worth putting in the if/when it's actually a problem category
          Hide
          jbellis Jonathan Ellis added a comment -

          Made some minor changes (r/m lib/license file in 01, and r/m some unused variables in 02) but it still needs the build tweaked to work after the src/cassandra.avro change (i'm guessing that's the culprit).

          Show
          jbellis Jonathan Ellis added a comment - Made some minor changes (r/m lib/license file in 01, and r/m some unused variables in 02) but it still needs the build tweaked to work after the src/cassandra.avro change (i'm guessing that's the culprit).
          Hide
          stuhood Stu Hood added a comment -

          Fixed the build problem: ant didn't like ivysettings.xml being loaded explicitly . No other changes from your rendition.

          Show
          stuhood Stu Hood added a comment - Fixed the build problem: ant didn't like ivysettings.xml being loaded explicitly . No other changes from your rendition.
          Hide
          stuhood Stu Hood added a comment -

          Ah humbug... nevermind... it only works directly after an 'ant realclean'.

          Show
          stuhood Stu Hood added a comment - Ah humbug... nevermind... it only works directly after an 'ant realclean'.
          Hide
          jbellis Jonathan Ellis added a comment -

          which build.xml is correct?

          Show
          jbellis Jonathan Ellis added a comment - which build.xml is correct?
          Hide
          stuhood Stu Hood added a comment -

          I've rebased this, and I can't get the build to fail anymore... really no clue what was going on.

          Show
          stuhood Stu Hood added a comment - I've rebased this, and I can't get the build to fail anymore... really no clue what was going on.
          Hide
          jbellis Jonathan Ellis added a comment -

          rebased & committed

          Show
          jbellis Jonathan Ellis added a comment - rebased & committed
          Hide
          hudson Hudson added a comment -

          Integrated in Cassandra #533 (See https://hudson.apache.org/hudson/job/Cassandra/533/)

          Show
          hudson Hudson added a comment - Integrated in Cassandra #533 (See https://hudson.apache.org/hudson/job/Cassandra/533/ )

            People

            • Assignee:
              stuhood Stu Hood
              Reporter:
              stuhood Stu Hood
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development