|
[
Permlink
| « Hide
]
Doug Cutting added a comment - 20/May/06 02:49 AM
It would be best to avoid deserializing & reserializing objects in Java before passing them to C. So you might instead always use BytesWritable for the input and output keys and values in Java, and implement a subclass of SequenceFileInputFormat that uses SequenceFile.next(DataOutputBuffer) to copy the raw binary content into the BytesWritable. Then your mapper and reducer implementations could simply get the bytes from the BytesWritable and pass it to the native method, without re-serializing. Does this make sense?
I agree with Doug that if you are going to throw away the advantages of actually having meaningful types on the C++ that the keys and values on the Java side should be BytesWritable.
That said, I think it would be much less error-prone for the users and easier to understand and debug if you followed the Hadoop API much closer. Define a Writable and WritableComparable interfaces in C++. The Record IO classes will support them with a minor change to the code generator. I agree that keys/values should be BytesWritable in case we are not using typed data (Records). But I am trying to understand how to to avoid serialization/de-serialization multiple times between Java and C++ and still use Record IO.
So when we define a new record format and generate classes to work with the record, the generated classes contain Reader and Writer for the record. These read/write 'Record' objects from streams. One way to implement this would be to modify the class generation and make Reader extend SequenceInputFile and return(optionally) a ByetWritable rather than a Record. The idea is that the Record know how to read itself and its size etc. so we let it read from the input but it reads in BytesWritable rather than the record type in Java. we pass this BytesWritable to C++ and do deserialization only once to get the right type. With this in place we should be able to avoid the need for serialization/de-serialization and users will not need to write extra code per type of record. I think we can mostly achieve what we want by using the SequenceFile.next(DataOutputBuffer) method to read the raw bytes of each entry, regardless of the declared types. Thus the SequenceFile can still be created with the correct Java key and value classes, but when we actually read entries in Java we won't use instances of those classes, but rather just read the raw content of the entry into a byte-array container like BytesWritable that is passed to C. Then the C code can deserialize the Record instance from the byte-array container. So a C-based mapper should specify the real input key and value classes, but its InputFormat implementation will ignore that when entries are read, passing raw bytes to C.
This can be done by providing just one C++ class in record IO. Currently, C++ version for record IO generates methods for each class that read from and write to InStream and OutStream interfaces, that contain only read and write methods. Creation of concrete classes that implement these interfaces is outside of the code generation. One could proovide a byteStream class in C++ that provides these interfaces. The construction of BytesOutStream happens in C++ and after serializing C++ record, this stream goes to Java via JNI, which is then converted to BytesWritable and written to the sequencefile. BytesInStream is created in Java tied with the sequencefile, and supplies bytes to the C++ record.
I am not sure if I am following correctly here, I am new to the codebase and so I took some time to document the MapReduce flow in the system. There are some things I am not clear about (bold italics in attachment). would be great if you could clarify. It may help others new to the project.
In this specific case, RecordReader.next() is called by the framework (MapTask) and next() method knows the boundry of the next Record in the File. In case it is a complex RecordIO type the boundry wont be known to anyone except the generated object, which contains code for reading and writing the object. This is the part which is not clear how a generic SequenceFile would know size/boundries of RecordIO objects. If the generated code contains code for reading the next record in a ByteWritable in addition to Record, this will be easy to implement.
Sanjay Dahiya made changes - 26/May/06 07:17 AM
Sanjay Dahiya made changes - 30/May/06 06:58 PM
Sameer Paranjpye made changes - 31/May/06 07:25 AM
Doug Cutting made changes - 06/Jun/06 06:16 AM
Doug Cutting made changes - 07/Jun/06 04:38 AM
Doug Cutting made changes - 29/Jun/06 04:11 AM
Doug Cutting made changes - 03/Aug/06 05:46 PM
Doug Cutting made changes - 04/Aug/06 08:05 PM
Doug Cutting made changes - 08/Sep/06 08:23 PM
[[ Old comment, sent by email on Mon, 22 May 2006 17:28:16 +0530 ]] Hi Doug - ~Sanjay
Doug Cutting made changes - 15/Dec/06 09:40 PM
Owen O'Malley made changes - 18/Feb/07 08:27 AM
This is my current plan of approach.
Owen O'Malley made changes - 18/Feb/07 08:28 AM
Owen O'Malley made changes - 18/Feb/07 08:30 AM
Owen O'Malley made changes - 18/Feb/07 08:31 AM
+1 This proposal looks good. I note that the contexts could, with the addition of a nextKey() method, be turned into iterators, permitting alternate map and reduce control structures, like MapRunnable. This may prove to be a feature.
Doug Cutting made changes - 02/Mar/07 10:17 PM
Owen O'Malley made changes - 12/Apr/07 05:22 PM
Owen O'Malley made changes - 12/Apr/07 05:22 PM
Here is a first pass of this patch. I still need to add unit tests.
Owen O'Malley made changes - 18/Apr/07 10:30 PM
Owen O'Malley made changes - 20/Apr/07 08:50 PM
A few quick comments:
Owen O'Malley made changes - 16/May/07 12:17 AM
Owen O'Malley made changes - 16/May/07 12:21 AM
Owen O'Malley made changes - 16/May/07 12:22 AM
+1
http://issues.apache.org/jira/secure/attachment/12357433/pipes-2.patch Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/146/testReport/ I just committed this. Thanks, Owen!
Doug Cutting made changes - 16/May/07 07:23 PM
Integrated in Hadoop-Nightly #91 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/91/
Doug Cutting made changes - 20/Aug/07 06:11 PM
Owen O'Malley made changes - 08/Jul/09 04:41 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||