Description
Extending Hama Pipes to use C++ templates
Currently all messages are converted to strings before they are transferred over a socket communication between C++ and Java and vice versa.
To take advantage of the binary socket communication we will serialize and deserialize basic types like int, long, float, double directly without converting to strings. This will minimize the risk of type conversation errors. Other types (except these basic types) are transferred as strings.
It's also possible to create custom Writables and serialize and deserialize the object to string by overriding the following methods. (e.g., PipesVectorWritable and PipesKeyValueWritable)
@Override public void readFields(DataInput in) throws IOException @Override public void write(DataOutput out) throws IOException
Hama Streaming, which depends on Hama Pipes, is still using strings.
The following methods change
virtual void sendMessage(const string& peerName, const string& msg)
virtual const string& getCurrentMessage()
virtual void write(const string& key, const string& value)
virtual bool readNext(string& key, string& value)
to support C++ templates:
virtual void sendMessage(const string& peer_name, const M& msg)
virtual M getCurrentMessage()
virtual void write(const K2& key, const V2& value)
virtual bool readNext(K1& key, V1& value)
Also SequenceFile functions uses templates:
bool sequenceFileReadNext(int32_t file_id, K& key, V& value)
bool sequenceFileAppend(int32_t file_id, const K& key, const V& value)
And the native Partitioner supports it:
template<class K1, class V1, class K2, class V2, class M> class Partitioner { public: virtual int partition(const K1& key, const V1& value, int32_t num_tasks) = 0; virtual ~Partitioner() {} };
This will minimize type conversation errors and change the compilation procedure. Because of the nature of C++ templates, static libraries are not possible anymore. The compiler will substitute all templates at compile time.
The compile command will look like:
g++ -m64 -Ic++/src/main/native/utils/api \ -Ic++/src/main/native/pipes/api \ -Lc++/target/native \ -lhadooputils -lpthread \ PROGRAM.cc \ -o PROGRAM \ -g -Wall -O2
Finally the job configuration supports the following properties:
<property> <name>bsp.input.format.class</name> <value>org.apache.hama.bsp.KeyValueTextInputFormat</value> </property> <property> <name>bsp.input.key.class</name> <value>org.apache.hadoop.io.Text</value> </property> <property> <name>bsp.input.value.class</name> <value>org.apache.hadoop.io.Text</value> </property> <property> <name>bsp.output.format.class</name> <value>org.apache.hama.bsp.SequenceFileOutputFormat</value> </property> <property> <name>bsp.output.key.class</name> <value>org.apache.hadoop.io.Text</value> </property> <property> <name>bsp.output.value.class</name> <value>org.apache.hadoop.io.DoubleWritable</value> </property> <property> <name>bsp.message.class</name> <value>org.apache.hadoop.io.DoubleWritable</value> </property>