Issue Details (XML | Word | Printable)

Key: HADOOP-234
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Owen O'Malley
Reporter: Sanjay Dahiya
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Hadoop Pipes for writing map/reduce jobs in C++ and python

Created: 19/May/06 04:28 PM   Updated: 08/Jul/09 04:41 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.14.0

Time Tracking:
Not Specified

File Attachments:
  Size
PDF File Hadoop MaReduce Developer doc.pdf 2006-05-30 06:58 PM Sanjay Dahiya 112 kB
HTML File Licensed for inclusion in ASF works hadoop-pipes.html 2007-02-18 08:28 AM Owen O'Malley 11 kB
Text File Licensed for inclusion in ASF works pipes-2.patch 2007-05-16 12:21 AM Owen O'Malley 3.98 MB
Text File Licensed for inclusion in ASF works pipes.patch 2007-04-18 10:30 PM Owen O'Malley 3.98 MB
Issue Links:
Dependants
 

Resolution Date: 16/May/07 07:23 PM


 Description  « Hide
MapReduce C++ support

Requirements

1. Allow users to write Map, Reduce, RecordReader, and RecordWriter functions in C++, rest of the infrastructure already present in Java should be reused.
2. Avoid users having to write both Java and C++ for this to work.
3. Avoid users having to work with JNI methods directly by wrapping them in helper functions.
4. The interface should be SWIG'able.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doug Cutting added a comment - 20/May/06 02:49 AM
It would be best to avoid deserializing & reserializing objects in Java before passing them to C. So you might instead always use BytesWritable for the input and output keys and values in Java, and implement a subclass of SequenceFileInputFormat that uses SequenceFile.next(DataOutputBuffer) to copy the raw binary content into the BytesWritable. Then your mapper and reducer implementations could simply get the bytes from the BytesWritable and pass it to the native method, without re-serializing. Does this make sense?

Owen O'Malley added a comment - 23/May/06 05:03 AM
I agree with Doug that if you are going to throw away the advantages of actually having meaningful types on the C++ that the keys and values on the Java side should be BytesWritable.

That said, I think it would be much less error-prone for the users and easier to understand and debug if you followed the Hadoop API much closer. Define a Writable and WritableComparable interfaces in C++. The Record IO classes will support them with a minor change to the code generator.


Sanjay Dahiya added a comment - 23/May/06 01:28 PM
I agree that keys/values should be BytesWritable in case we are not using typed data (Records). But I am trying to understand how to to avoid serialization/de-serialization multiple times between Java and C++ and still use Record IO.
So when we define a new record format and generate classes to work with the record, the generated classes contain Reader and Writer for the record. These read/write 'Record' objects from streams. One way to implement this would be to modify the class generation and make Reader extend SequenceInputFile and return(optionally) a ByetWritable rather than a Record. The idea is that the Record know how to read itself and its size etc. so we let it read from the input but it reads in BytesWritable rather than the record type in Java. we pass this BytesWritable to C++ and do deserialization only once to get the right type.
With this in place we should be able to avoid the need for serialization/de-serialization and users will not need to write extra code per type of record.

Doug Cutting added a comment - 23/May/06 11:48 PM
I think we can mostly achieve what we want by using the SequenceFile.next(DataOutputBuffer) method to read the raw bytes of each entry, regardless of the declared types. Thus the SequenceFile can still be created with the correct Java key and value classes, but when we actually read entries in Java we won't use instances of those classes, but rather just read the raw content of the entry into a byte-array container like BytesWritable that is passed to C. Then the C code can deserialize the Record instance from the byte-array container. So a C-based mapper should specify the real input key and value classes, but its InputFormat implementation will ignore that when entries are read, passing raw bytes to C.

Milind Bhandarkar added a comment - 24/May/06 12:32 AM
This can be done by providing just one C++ class in record IO. Currently, C++ version for record IO generates methods for each class that read from and write to InStream and OutStream interfaces, that contain only read and write methods. Creation of concrete classes that implement these interfaces is outside of the code generation. One could proovide a byteStream class in C++ that provides these interfaces. The construction of BytesOutStream happens in C++ and after serializing C++ record, this stream goes to Java via JNI, which is then converted to BytesWritable and written to the sequencefile. BytesInStream is created in Java tied with the sequencefile, and supplies bytes to the C++ record.

Sanjay Dahiya added a comment - 26/May/06 07:17 AM
I am not sure if I am following correctly here, I am new to the codebase and so I took some time to document the MapReduce flow in the system. There are some things I am not clear about (bold italics in attachment). would be great if you could clarify. It may help others new to the project.

In this specific case, RecordReader.next() is called by the framework (MapTask) and next() method knows the boundry of the next Record in the File. In case it is a complex RecordIO type the boundry wont be known to anyone except the generated object, which contains code for reading and writing the object. This is the part which is not clear how a generic SequenceFile would know size/boundries of RecordIO objects. If the generated code contains code for reading the next record in a ByteWritable in addition to Record, this will be easy to implement.
I think I am missing something here, can you please clarify ?


Sanjay Dahiya added a comment - 30/May/06 06:58 PM
Updated ...

Sanjay Dahiya added a comment - 14/Oct/06 02:07 AM

[[ Old comment, sent by email on Mon, 22 May 2006 17:28:16 +0530 ]]

Hi Doug -
I was also looking for a way to avoid serialization and de-
serialization, but I am still not clear how do we use existing record
IO with this (without modifying generated classes)
When we define a new record format and generate classes to work with
the record, the generated classes contain Reader and Writer for the
record. These read/write 'Record' objects from streams. One way to
implement this would be to modify the class generation and make
Reader extend SequenceInputFile and return(optionally) a ByetWritable
rather than a Record. With this in place we should be able to avoid
the need for serialization/de-serialization and users will not need
to write extra code per type of record. Or am I missing something here ?

~Sanjay


Owen O'Malley added a comment - 18/Feb/07 08:28 AM
This is my current plan of approach.

Doug Cutting added a comment - 19/Feb/07 10:44 PM
+1 This proposal looks good. I note that the contexts could, with the addition of a nextKey() method, be turned into iterators, permitting alternate map and reduce control structures, like MapRunnable. This may prove to be a feature.

Owen O'Malley added a comment - 18/Apr/07 10:30 PM
Here is a first pass of this patch. I still need to add unit tests.

Doug Cutting added a comment - 20/Apr/07 09:25 PM
A few quick comments:
  • the pipes package should have a package.html file
  • only Submitter.java should be public, right?
  • one must 'chmod +x' the configure scripts
  • the create-c++-configure task fails for me on Ubuntu w/ 'Can't exec "aclocal"'
  • 'bin/hadoop pipes' should print a helpful message
  • a README.txt file in src/examples/pipes telling how to run things would go a long way


Doug Cutting added a comment - 16/May/07 07:23 PM
I just committed this. Thanks, Owen!

Hadoop QA added a comment - 17/May/07 11:26 AM