Hadoop Common
  1. Hadoop Common
  2. HADOOP-234

Hadoop Pipes for writing map/reduce jobs in C++ and python

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.14.0
    • Component/s: None
    • Labels:
      None

      Description

      MapReduce C++ support

      Requirements

      1. Allow users to write Map, Reduce, RecordReader, and RecordWriter functions in C++, rest of the infrastructure already present in Java should be reused.
      2. Avoid users having to write both Java and C++ for this to work.
      3. Avoid users having to work with JNI methods directly by wrapping them in helper functions.
      4. The interface should be SWIG'able.

      1. pipes-2.patch
        3.98 MB
        Owen O'Malley
      2. pipes.patch
        3.98 MB
        Owen O'Malley
      3. hadoop-pipes.html
        11 kB
        Owen O'Malley
      4. Hadoop MaReduce Developer doc.pdf
        112 kB
        Sanjay Dahiya

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          It would be best to avoid deserializing & reserializing objects in Java before passing them to C. So you might instead always use BytesWritable for the input and output keys and values in Java, and implement a subclass of SequenceFileInputFormat that uses SequenceFile.next(DataOutputBuffer) to copy the raw binary content into the BytesWritable. Then your mapper and reducer implementations could simply get the bytes from the BytesWritable and pass it to the native method, without re-serializing. Does this make sense?

          Show
          Doug Cutting added a comment - It would be best to avoid deserializing & reserializing objects in Java before passing them to C. So you might instead always use BytesWritable for the input and output keys and values in Java, and implement a subclass of SequenceFileInputFormat that uses SequenceFile.next(DataOutputBuffer) to copy the raw binary content into the BytesWritable. Then your mapper and reducer implementations could simply get the bytes from the BytesWritable and pass it to the native method, without re-serializing. Does this make sense?
          Hide
          Owen O'Malley added a comment -

          I agree with Doug that if you are going to throw away the advantages of actually having meaningful types on the C++ that the keys and values on the Java side should be BytesWritable.

          That said, I think it would be much less error-prone for the users and easier to understand and debug if you followed the Hadoop API much closer. Define a Writable and WritableComparable interfaces in C++. The Record IO classes will support them with a minor change to the code generator.

          Show
          Owen O'Malley added a comment - I agree with Doug that if you are going to throw away the advantages of actually having meaningful types on the C++ that the keys and values on the Java side should be BytesWritable. That said, I think it would be much less error-prone for the users and easier to understand and debug if you followed the Hadoop API much closer. Define a Writable and WritableComparable interfaces in C++. The Record IO classes will support them with a minor change to the code generator.
          Hide
          Sanjay Dahiya added a comment -

          I agree that keys/values should be BytesWritable in case we are not using typed data (Records). But I am trying to understand how to to avoid serialization/de-serialization multiple times between Java and C++ and still use Record IO.
          So when we define a new record format and generate classes to work with the record, the generated classes contain Reader and Writer for the record. These read/write 'Record' objects from streams. One way to implement this would be to modify the class generation and make Reader extend SequenceInputFile and return(optionally) a ByetWritable rather than a Record. The idea is that the Record know how to read itself and its size etc. so we let it read from the input but it reads in BytesWritable rather than the record type in Java. we pass this BytesWritable to C++ and do deserialization only once to get the right type.
          With this in place we should be able to avoid the need for serialization/de-serialization and users will not need to write extra code per type of record.

          Show
          Sanjay Dahiya added a comment - I agree that keys/values should be BytesWritable in case we are not using typed data (Records). But I am trying to understand how to to avoid serialization/de-serialization multiple times between Java and C++ and still use Record IO. So when we define a new record format and generate classes to work with the record, the generated classes contain Reader and Writer for the record. These read/write 'Record' objects from streams. One way to implement this would be to modify the class generation and make Reader extend SequenceInputFile and return(optionally) a ByetWritable rather than a Record. The idea is that the Record know how to read itself and its size etc. so we let it read from the input but it reads in BytesWritable rather than the record type in Java. we pass this BytesWritable to C++ and do deserialization only once to get the right type. With this in place we should be able to avoid the need for serialization/de-serialization and users will not need to write extra code per type of record.
          Hide
          Doug Cutting added a comment -

          I think we can mostly achieve what we want by using the SequenceFile.next(DataOutputBuffer) method to read the raw bytes of each entry, regardless of the declared types. Thus the SequenceFile can still be created with the correct Java key and value classes, but when we actually read entries in Java we won't use instances of those classes, but rather just read the raw content of the entry into a byte-array container like BytesWritable that is passed to C. Then the C code can deserialize the Record instance from the byte-array container. So a C-based mapper should specify the real input key and value classes, but its InputFormat implementation will ignore that when entries are read, passing raw bytes to C.

          Show
          Doug Cutting added a comment - I think we can mostly achieve what we want by using the SequenceFile.next(DataOutputBuffer) method to read the raw bytes of each entry, regardless of the declared types. Thus the SequenceFile can still be created with the correct Java key and value classes, but when we actually read entries in Java we won't use instances of those classes, but rather just read the raw content of the entry into a byte-array container like BytesWritable that is passed to C. Then the C code can deserialize the Record instance from the byte-array container. So a C-based mapper should specify the real input key and value classes, but its InputFormat implementation will ignore that when entries are read, passing raw bytes to C.
          Hide
          Milind Bhandarkar added a comment -

          This can be done by providing just one C++ class in record IO. Currently, C++ version for record IO generates methods for each class that read from and write to InStream and OutStream interfaces, that contain only read and write methods. Creation of concrete classes that implement these interfaces is outside of the code generation. One could proovide a byteStream class in C++ that provides these interfaces. The construction of BytesOutStream happens in C++ and after serializing C++ record, this stream goes to Java via JNI, which is then converted to BytesWritable and written to the sequencefile. BytesInStream is created in Java tied with the sequencefile, and supplies bytes to the C++ record.

          Show
          Milind Bhandarkar added a comment - This can be done by providing just one C++ class in record IO. Currently, C++ version for record IO generates methods for each class that read from and write to InStream and OutStream interfaces, that contain only read and write methods. Creation of concrete classes that implement these interfaces is outside of the code generation. One could proovide a byteStream class in C++ that provides these interfaces. The construction of BytesOutStream happens in C++ and after serializing C++ record, this stream goes to Java via JNI, which is then converted to BytesWritable and written to the sequencefile. BytesInStream is created in Java tied with the sequencefile, and supplies bytes to the C++ record.
          Hide
          Sanjay Dahiya added a comment -

          I am not sure if I am following correctly here, I am new to the codebase and so I took some time to document the MapReduce flow in the system. There are some things I am not clear about (bold italics in attachment). would be great if you could clarify. It may help others new to the project.

          In this specific case, RecordReader.next() is called by the framework (MapTask) and next() method knows the boundry of the next Record in the File. In case it is a complex RecordIO type the boundry wont be known to anyone except the generated object, which contains code for reading and writing the object. This is the part which is not clear how a generic SequenceFile would know size/boundries of RecordIO objects. If the generated code contains code for reading the next record in a ByteWritable in addition to Record, this will be easy to implement.
          I think I am missing something here, can you please clarify ?

          Show
          Sanjay Dahiya added a comment - I am not sure if I am following correctly here, I am new to the codebase and so I took some time to document the MapReduce flow in the system. There are some things I am not clear about (bold italics in attachment). would be great if you could clarify. It may help others new to the project. In this specific case, RecordReader.next() is called by the framework (MapTask) and next() method knows the boundry of the next Record in the File. In case it is a complex RecordIO type the boundry wont be known to anyone except the generated object, which contains code for reading and writing the object. This is the part which is not clear how a generic SequenceFile would know size/boundries of RecordIO objects. If the generated code contains code for reading the next record in a ByteWritable in addition to Record, this will be easy to implement. I think I am missing something here, can you please clarify ?
          Hide
          Sanjay Dahiya added a comment -

          Updated ...

          Show
          Sanjay Dahiya added a comment - Updated ...
          Hide
          Sanjay Dahiya added a comment -

          [[ Old comment, sent by email on Mon, 22 May 2006 17:28:16 +0530 ]]

          Hi Doug -
          I was also looking for a way to avoid serialization and de-
          serialization, but I am still not clear how do we use existing record
          IO with this (without modifying generated classes)
          When we define a new record format and generate classes to work with
          the record, the generated classes contain Reader and Writer for the
          record. These read/write 'Record' objects from streams. One way to
          implement this would be to modify the class generation and make
          Reader extend SequenceInputFile and return(optionally) a ByetWritable
          rather than a Record. With this in place we should be able to avoid
          the need for serialization/de-serialization and users will not need
          to write extra code per type of record. Or am I missing something here ?

          ~Sanjay

          Show
          Sanjay Dahiya added a comment - [[ Old comment, sent by email on Mon, 22 May 2006 17:28:16 +0530 ]] Hi Doug - I was also looking for a way to avoid serialization and de- serialization, but I am still not clear how do we use existing record IO with this (without modifying generated classes) When we define a new record format and generate classes to work with the record, the generated classes contain Reader and Writer for the record. These read/write 'Record' objects from streams. One way to implement this would be to modify the class generation and make Reader extend SequenceInputFile and return(optionally) a ByetWritable rather than a Record. With this in place we should be able to avoid the need for serialization/de-serialization and users will not need to write extra code per type of record. Or am I missing something here ? ~Sanjay
          Hide
          Owen O'Malley added a comment -

          This is my current plan of approach.

          Show
          Owen O'Malley added a comment - This is my current plan of approach.
          Hide
          Doug Cutting added a comment -

          +1 This proposal looks good. I note that the contexts could, with the addition of a nextKey() method, be turned into iterators, permitting alternate map and reduce control structures, like MapRunnable. This may prove to be a feature.

          Show
          Doug Cutting added a comment - +1 This proposal looks good. I note that the contexts could, with the addition of a nextKey() method, be turned into iterators, permitting alternate map and reduce control structures, like MapRunnable. This may prove to be a feature.
          Hide
          Owen O'Malley added a comment -

          Here is a first pass of this patch. I still need to add unit tests.

          Show
          Owen O'Malley added a comment - Here is a first pass of this patch. I still need to add unit tests.
          Hide
          Doug Cutting added a comment -

          A few quick comments:

          • the pipes package should have a package.html file
          • only Submitter.java should be public, right?
          • one must 'chmod +x' the configure scripts
          • the create-c++-configure task fails for me on Ubuntu w/ 'Can't exec "aclocal"'
          • 'bin/hadoop pipes' should print a helpful message
          • a README.txt file in src/examples/pipes telling how to run things would go a long way
          Show
          Doug Cutting added a comment - A few quick comments: the pipes package should have a package.html file only Submitter.java should be public, right? one must 'chmod +x' the configure scripts the create-c++-configure task fails for me on Ubuntu w/ 'Can't exec "aclocal"' 'bin/hadoop pipes' should print a helpful message a README.txt file in src/examples/pipes telling how to run things would go a long way
          Show
          Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12357433/pipes-2.patch applied and successfully tested against trunk revision r538318. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/146/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/146/console
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Owen!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Owen!
          Hide
          Hadoop QA added a comment -
          Show
          Hadoop QA added a comment - Integrated in Hadoop-Nightly #91 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/91/ )

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Sanjay Dahiya
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development