Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.1
    • Fix Version/s: None
    • Component/s: task
    • Labels:
      None
    • Environment:

      hadoop linux

    • Hadoop Flags:
      Incompatible change
    • Tags:
      PIPES C++

      Description

      Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
      1 To provide C++ API. We mostly use Streaming before, and we also try to use PIPES, but we do not find PIPES is more efficient than Streaming. So we

      think a new C++ extention is needed for us.
      2 Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce Child JVM.
      3 It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming, pipe or socket is not efficient to carry so huge data.

      What we want to do:
      1 We do not use map/reduce Child JVM to do any data processing, which just prepares environment, starts C++ mapper, tells mapper which split it should deal with, and reads report from mapper until that finished. The mapper will read record, ivoke user defined map, to do partition, write spill, combine and merge into file.out. We think these operations can be done by C++ code.
      2 Reducer is similar to mapper, it was started after sort finished, it read from sorted files, ivoke user difined reduce, and write to user defined record writer.
      3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
      at first, 1 and 2, then 3.

      What's the difference with PIPES:
      1 Yes, We will reuse most PIPES code.
      2 And, We should do it more completely, nothing changed in scheduling and management, but everything in execution.

      UPDATE:

      Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
      This is a full package with all hadoop source code.
      Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it in your cluster.

      Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other specifications of the interface.

      Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to Java MapRed and Pipes.

      Any comments are welcomed.

      1. HADOOP-HCE-1.0.0.patch
        7.98 MB
        Dong Yang
      2. HCE InstallMenu.pdf
        207 kB
        Fusheng Han
      3. HCE Performance Report.pdf
        349 kB
        Fusheng Han
      4. HCE Tutorial.pdf
        173 kB
        Fusheng Han
      5. Overall Design of Hadoop C++ Extension.doc
        613 kB
        Dong Yang

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          This is pretty interesting. How are you implementing TaskUmbilicalProtocol?

          Show
          Todd Lipcon added a comment - This is pretty interesting. How are you implementing TaskUmbilicalProtocol?
          Hide
          Dong Yang added a comment -

          1. Child JVM Process is reserved, which is used for setting up runtime enviroment, starting C++ process, and in charge of contacting with hadoop, excluding data R/W logic.
          2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout.
          3. C++ process can only accept command, deal with data, and report states, which is not concerned with scheduling and exception handling.

          Show
          Dong Yang added a comment - 1. Child JVM Process is reserved, which is used for setting up runtime enviroment, starting C++ process, and in charge of contacting with hadoop, excluding data R/W logic. 2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout. 3. C++ process can only accept command, deal with data, and report states, which is not concerned with scheduling and exception handling.
          Hide
          Dong Yang added a comment -

          1. Child JVM Process is reserved, which is used for setting up runtime enviroment, starting C++ process, and in charge of contacting with hadoop, excluding data R/W logic.
          2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout.
          3. C++ process can only accept command, deal with data, and report states, which is not concerned with scheduling and exception handling.

          Show
          Dong Yang added a comment - 1. Child JVM Process is reserved, which is used for setting up runtime enviroment, starting C++ process, and in charge of contacting with hadoop, excluding data R/W logic. 2. Child JVM Process communicates with C++ process, via stdin, stderr or stdout. 3. C++ process can only accept command, deal with data, and report states, which is not concerned with scheduling and exception handling.
          Hide
          Zheng Shao added a comment -

          Any progress on this?

          Show
          Zheng Shao added a comment - Any progress on this?
          Hide
          He Yongqiang added a comment -

          Hi Dong / Shouyan,
          Are you going to open source this? If yes, can you update the recent work? This can help others to better understand.

          Show
          He Yongqiang added a comment - Hi Dong / Shouyan, Are you going to open source this? If yes, can you update the recent work? This can help others to better understand.
          Hide
          Fusheng Han added a comment -

          This project is undergoing inside Baidu. The basic functions have completed. We get the HCE(Hadoop C++ Extension) run fluently with Text input and without any compression. About 20 percent improvement has achieved compared to Streaming. 40GB input and 5 nodes are used in this experiment. And MapReduce application is wordcounter.

          The interfaces exposed to users are similar with PIPES. Mapper interface is
          class Mapper {
          public:
          virtual int64_t setup()

          {return 0;}
          virtual int64_t cleanup(bool isSuccessful) {return 0;}

          virtual int64_t map(MapInput &input) = 0;

          protected:
          virtual void emit(const void* key, const int64_t keyLength,
          const void* value, const int64_t valueLength)

          { getContext()->emit(key, keyLength, value, valueLength); }
          virtual TaskContext* getContext() { return context; }
          };
          Modeled after new hadoop MapReduce interface, setup() and cleanup() functions are added here. MapInput is a new defined type for map input. Key and value can be retrieved from this object. An emit() function is provided here, which can be invoked directly in map() function. Types of key and value are all raw memory pointer followed by corresponding length. This is better for non-text manipulation.

          The Reducer is same with Mapper:
          class Reducer {
          public:
          virtual int64_t setup() {return 0;}
          virtual int64_t cleanup(bool isSuccessful) {return 0;}
          virtual int64_t reduce(ReduceInput &input) = 0;

          protected:
          virtual void emit(const void* key, const int64_t keyLength,
          const void* value, const int64_t valueLength) { getContext()->emit(key, keyLength, value, valueLength); }


          virtual TaskContext* getContext()

          { return context; }
          };
          A slightly difference is that ReduceInput can get iterative values with next() function.

          In hadoop MapReduce, interface of Combiner has no difference from Reduce. Here we get a little change that Combiner can only emit value (no key parameter in emit function). The consideration that omitting key from emit pair of combine function is due to mistaken keys may corrupt the order of the map output. The output key of emit() funtion is determined by the input.
          class Combiner {
          public:
          virtual int64_t setup() {return 0;}
          virtual int64_t cleanup(bool isSuccessful) {return 0;}
          virtual int64_t combine(ReduceInput &input) = 0;

          protected:
          virtual void emit(const void* value, const int64_t valueLength) { getContext()->emit(getCombineKey(), getCombineKeyLength(), value, valueLength); }
          virtual TaskContext* getContext() { return context; }


          virtual const void* getCombineKey()

          { return combineKey; }

          virtual int64_t getCombineKeyLength()

          { return combineKeyLength; }

          };

          The Partitioner also gets setup() and cleanup() functions:
          class Partitioner {
          public:
          virtual int64_t setup()

          {return 0;}
          virtual int64_t cleanup() {return 0;}

          virtual int partition(const void* key, const int64_t keyLength, int numOfReduces) = 0;
          };

          Following pipes, we add a new entry with the name "HCE" in hadoop command. Users run command like "hadoop hce XXX" to invoke HCE MapReduce.

          We'd like to hear your comments.

          Show
          Fusheng Han added a comment - This project is undergoing inside Baidu. The basic functions have completed. We get the HCE(Hadoop C++ Extension) run fluently with Text input and without any compression. About 20 percent improvement has achieved compared to Streaming. 40GB input and 5 nodes are used in this experiment. And MapReduce application is wordcounter. The interfaces exposed to users are similar with PIPES. Mapper interface is class Mapper { public: virtual int64_t setup() {return 0;} virtual int64_t cleanup(bool isSuccessful) {return 0;} virtual int64_t map(MapInput &input) = 0; protected: virtual void emit(const void* key, const int64_t keyLength, const void* value, const int64_t valueLength) { getContext()->emit(key, keyLength, value, valueLength); } virtual TaskContext* getContext() { return context; } }; Modeled after new hadoop MapReduce interface, setup() and cleanup() functions are added here. MapInput is a new defined type for map input. Key and value can be retrieved from this object. An emit() function is provided here, which can be invoked directly in map() function. Types of key and value are all raw memory pointer followed by corresponding length. This is better for non-text manipulation. The Reducer is same with Mapper: class Reducer { public: virtual int64_t setup() {return 0;} virtual int64_t cleanup(bool isSuccessful) {return 0;} virtual int64_t reduce(ReduceInput &input) = 0; protected: virtual void emit(const void* key, const int64_t keyLength, const void* value, const int64_t valueLength) { getContext()->emit(key, keyLength, value, valueLength); } virtual TaskContext* getContext() { return context; } }; A slightly difference is that ReduceInput can get iterative values with next() function. In hadoop MapReduce, interface of Combiner has no difference from Reduce. Here we get a little change that Combiner can only emit value (no key parameter in emit function). The consideration that omitting key from emit pair of combine function is due to mistaken keys may corrupt the order of the map output. The output key of emit() funtion is determined by the input. class Combiner { public: virtual int64_t setup() {return 0;} virtual int64_t cleanup(bool isSuccessful) {return 0;} virtual int64_t combine(ReduceInput &input) = 0; protected: virtual void emit(const void* value, const int64_t valueLength) { getContext()->emit(getCombineKey(), getCombineKeyLength(), value, valueLength); } virtual TaskContext* getContext() { return context; } virtual const void* getCombineKey() { return combineKey; } virtual int64_t getCombineKeyLength() { return combineKeyLength; } }; The Partitioner also gets setup() and cleanup() functions: class Partitioner { public: virtual int64_t setup() {return 0;} virtual int64_t cleanup() {return 0;} virtual int partition(const void* key, const int64_t keyLength, int numOfReduces) = 0; }; Following pipes, we add a new entry with the name "HCE" in hadoop command. Users run command like "hadoop hce XXX" to invoke HCE MapReduce. We'd like to hear your comments.
          Hide
          Arun C Murthy added a comment -

          Fusheng, this is interesting.

          Could you please put up a design document? There are several pieces I'm interested in understanding better:

          1. Changes to the framework JobTracker/TaskTracker for e.g. changes to TaskRunner
          2. Implications to job-submission, serialization of job-conf etc. from a C++ job-client etc.
          3. I do not understand why you are changing semantics for Combiner, this is incompatible with Java Map-Reduce.
          4. I'd expect one to implement a C++ 'context object' for mappers, reducers etc. I don't see this in your api at all?

          I'm sure I'll have more comments once I see more details.

          Show
          Arun C Murthy added a comment - Fusheng, this is interesting. Could you please put up a design document? There are several pieces I'm interested in understanding better: Changes to the framework JobTracker/TaskTracker for e.g. changes to TaskRunner Implications to job-submission, serialization of job-conf etc. from a C++ job-client etc. I do not understand why you are changing semantics for Combiner, this is incompatible with Java Map-Reduce. I'd expect one to implement a C++ 'context object' for mappers, reducers etc. I don't see this in your api at all? I'm sure I'll have more comments once I see more details.
          Hide
          Arun C Murthy added a comment -

          Fusheng, thinking about this a bit more I have a suggestion to help push this through the hadoop framework in a more straight-forward manner and help this get committed:

          I'd propose you guys take existing hadoop pipes, keep all of its apis and implement the map-side sort, shuffle and reduce-side merge within pipes itself i.e. enhance hadoop pipes to have all of the 'data-path'. This way we can mark the 'C++ data-path' as experimental and co-exist with current functionality, thus it will be far easier to get more experience with this.

          Currently pipes allows one to implement a C++ RecordReader for the map and a C++ RecordWriter for the reduce. We can enhance pipes to collect the map-output, sort it in C++ and write out the IFile and index for the map-output. The reduces would do the shuffle, merge & 'reduce' call in C++ and use the existing infrastructure for the C++ recordwriter to write the outputs.

          A note of caution: You will need to worry about TaskCompletionEvents i.e. events which let the reduces know the identity and location of completed maps, currently the reduces talk to the TaskTracker via TaskUmbilicalProtocol for this information - and this might be a sticky bit. As an intermediate step, one possible way around is to change ReduceTask.java to relay the TaskCompletionEvents from the java Child to the C++ reducer.

          In terms of development, you could start developing on a svn branch of hadoop pipes.

          Thoughts?

          Show
          Arun C Murthy added a comment - Fusheng, thinking about this a bit more I have a suggestion to help push this through the hadoop framework in a more straight-forward manner and help this get committed: I'd propose you guys take existing hadoop pipes, keep all of its apis and implement the map-side sort, shuffle and reduce-side merge within pipes itself i.e. enhance hadoop pipes to have all of the 'data-path'. This way we can mark the 'C++ data-path' as experimental and co-exist with current functionality, thus it will be far easier to get more experience with this. Currently pipes allows one to implement a C++ RecordReader for the map and a C++ RecordWriter for the reduce. We can enhance pipes to collect the map-output, sort it in C++ and write out the IFile and index for the map-output. The reduces would do the shuffle, merge & 'reduce' call in C++ and use the existing infrastructure for the C++ recordwriter to write the outputs. A note of caution: You will need to worry about TaskCompletionEvents i.e. events which let the reduces know the identity and location of completed maps, currently the reduces talk to the TaskTracker via TaskUmbilicalProtocol for this information - and this might be a sticky bit. As an intermediate step, one possible way around is to change ReduceTask.java to relay the TaskCompletionEvents from the java Child to the C++ reducer. In terms of development, you could start developing on a svn branch of hadoop pipes. Thoughts?
          Hide
          Fusheng Han added a comment -

          Arun, I appreciate your comments.

          The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days.

          For Q3, we indeed change the interface of Combiner, while the semantics for Combiner is the same with Java Map-Reduce. It prevents mistaken use of Combiner. In the situation that two spills with sorted records will merge into file.out (the output of map phase). The data flow is in this way:
          -> two spills is read in a merged way
          -> Combiner receives sorted <key, value> pairs
          -> after manipulation, Combiner emits output <key, value> pairs
          -> the output is directly written in file.out
          If Combiner emits unrelated keys, the records in the file.out will not be fully sorted. In our interface, Combiner is not allowed to emit key and the output key is determined by the input. The sequence of records in file.out will be guaranteed.

          to be continued...

          Show
          Fusheng Han added a comment - Arun, I appreciate your comments. The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days. For Q3, we indeed change the interface of Combiner, while the semantics for Combiner is the same with Java Map-Reduce. It prevents mistaken use of Combiner. In the situation that two spills with sorted records will merge into file.out (the output of map phase). The data flow is in this way: -> two spills is read in a merged way -> Combiner receives sorted <key, value> pairs -> after manipulation, Combiner emits output <key, value> pairs -> the output is directly written in file.out If Combiner emits unrelated keys, the records in the file.out will not be fully sorted. In our interface, Combiner is not allowed to emit key and the output key is determined by the input. The sequence of records in file.out will be guaranteed. to be continued...
          Hide
          Luke Lu added a comment -

          Fusheng, feel free to attach the design doc if there is nothing confidential in it and Shouyan approves . There are plenty of people on the thread who understand Chinese. It'd help me explaining some details to Arun, now that I work next to him.

          On the combiner interface, I think it'd be better to add an emitValue convenient method instead of changing the interface, as there are quite a few legit uses.

          Show
          Luke Lu added a comment - Fusheng, feel free to attach the design doc if there is nothing confidential in it and Shouyan approves . There are plenty of people on the thread who understand Chinese. It'd help me explaining some details to Arun, now that I work next to him. On the combiner interface, I think it'd be better to add an emitValue convenient method instead of changing the interface, as there are quite a few legit uses.
          Hide
          Arun C Murthy added a comment -

          The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days.

          Thanks!

          For Q3, we indeed change the interface of Combiner, while the semantics for Combiner is the same with Java Map-Reduce. It prevents mistaken use of Combiner.

          It's a reasonable argument, but I'd recommend we stay compatible with both Java Map-Reduce and Pipes by having the same interface. FYI: both Java and Pipes explicitly disallow changing of keys in the combiner in the 'contract'. If the user does go ahead and change the key the application is not guaranteed to work.


          In terms of apis, as I previously mentioned I stronly recommend you start using the Hadoop Pipes apis and enhance it - this will ensure compatibility between Hadoop Pipes and HCE - again, please consider moving the sort/shuffle/merge to Hadoop Pipes as I recommended previously.

          Show
          Arun C Murthy added a comment - The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days. Thanks! For Q3, we indeed change the interface of Combiner, while the semantics for Combiner is the same with Java Map-Reduce. It prevents mistaken use of Combiner. It's a reasonable argument, but I'd recommend we stay compatible with both Java Map-Reduce and Pipes by having the same interface. FYI: both Java and Pipes explicitly disallow changing of keys in the combiner in the 'contract'. If the user does go ahead and change the key the application is not guaranteed to work. In terms of apis, as I previously mentioned I stronly recommend you start using the Hadoop Pipes apis and enhance it - this will ensure compatibility between Hadoop Pipes and HCE - again, please consider moving the sort/shuffle/merge to Hadoop Pipes as I recommended previously.
          Hide
          Hong Tang added a comment -

          The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days.

          There are many hadoop devs fluent in Chinese, so it might still be a good idea to share the original design doc.

          Show
          Hong Tang added a comment - The bad news is that our design document is written in Chinese. My team members and I will put some design details step by step in the next few days. There are many hadoop devs fluent in Chinese, so it might still be a good idea to share the original design doc.
          Hide
          Wang Shouyan added a comment -

          "In terms of apis, as I previously mentioned I stronly recommend you start using the Hadoop Pipes apis and enhance it - this will ensure compatibility between Hadoop Pipes and HCE - again, please consider moving the sort/shuffle/merge to Hadoop Pipes as I recommended previously."

          I do not agree with this opinion, if we need to establish standards of c++ API, I don't think we need to completely compatible with pipes API, because I don't think pipes API is carefully considerated, may be for compatibility of some other code, but never been discussed adequately。

          If we do need a C++ API , we should consider usability and extensibility more then compatibility, because I don't realize such compatibility problem is a problem for most users .

          If for usability and extensibility, any suggestion is welcome.

          Show
          Wang Shouyan added a comment - "In terms of apis, as I previously mentioned I stronly recommend you start using the Hadoop Pipes apis and enhance it - this will ensure compatibility between Hadoop Pipes and HCE - again, please consider moving the sort/shuffle/merge to Hadoop Pipes as I recommended previously." I do not agree with this opinion, if we need to establish standards of c++ API, I don't think we need to completely compatible with pipes API, because I don't think pipes API is carefully considerated, may be for compatibility of some other code, but never been discussed adequately。 If we do need a C++ API , we should consider usability and extensibility more then compatibility, because I don't realize such compatibility problem is a problem for most users . If for usability and extensibility, any suggestion is welcome.
          Hide
          Owen O'Malley added a comment -

          I don't think we need to completely compatible with pipes API

          I don't think there is enough motivation to have two different C++ APIs, so you should use the same interface. That does not mean that you can't change the API to be better. You can and should help make the APIs more usable and extensible.

          If we do need a C++ API , we should consider usability and extensibility more then compatibility, because I don't realize such compatibility problem is a problem for most users .

          There is a requirement to provide backwards compatibility of all of Hadoop's public APIs with the previous version. APIs and interfaces can be deprecated and then removed in a later version, but compatibility is not optional.

          Show
          Owen O'Malley added a comment - I don't think we need to completely compatible with pipes API I don't think there is enough motivation to have two different C++ APIs, so you should use the same interface. That does not mean that you can't change the API to be better. You can and should help make the APIs more usable and extensible. If we do need a C++ API , we should consider usability and extensibility more then compatibility, because I don't realize such compatibility problem is a problem for most users . There is a requirement to provide backwards compatibility of all of Hadoop's public APIs with the previous version. APIs and interfaces can be deprecated and then removed in a later version, but compatibility is not optional.
          Hide
          Owen O'Malley added a comment -

          By the way, here is an archive of the message that I sent back in Nov 07 comparing the performance of Java, pipes, and streaming.

          http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg02961.html

          Especially by reimplementing the sort and shuffle, you should be able to get much faster than Java. smile

          Show
          Owen O'Malley added a comment - By the way, here is an archive of the message that I sent back in Nov 07 comparing the performance of Java, pipes, and streaming. http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg02961.html Especially by reimplementing the sort and shuffle, you should be able to get much faster than Java. smile
          Hide
          Dong Yang added a comment -

          Hadoop C++ Extension (HCE for short) is a framework for making mapreduce more stable and faster.
          Here is the overall design of HCE, welcome to give your viewpoints on its practical implementation.

          Show
          Dong Yang added a comment - Hadoop C++ Extension (HCE for short) is a framework for making mapreduce more stable and faster. Here is the overall design of HCE, welcome to give your viewpoints on its practical implementation.
          Hide
          zhang.pengfei added a comment -

          Woo!~~~~~ sounds so cool!

          now you want to opensource it ?

          come on

          Show
          zhang.pengfei added a comment - Woo!~~~~~ sounds so cool! now you want to opensource it ? come on
          Hide
          Owen O'Malley added a comment -

          Posting entire tarballs isn't very useful. Can you include your changes as a patch?

          Show
          Owen O'Malley added a comment - Posting entire tarballs isn't very useful. Can you include your changes as a patch?
          Hide
          Wang Shouyan added a comment -

          Posting entire tarballs is just for trial, we will deploy it in our production environment first , and later provide a patch for trunk.

          Show
          Wang Shouyan added a comment - Posting entire tarballs is just for trial, we will deploy it in our production environment first , and later provide a patch for trunk.
          Hide
          Dong Yang added a comment -

          Here is a HADOOP-HCE-1.0.0.patch for mapreduce trunk (revision 963075), which includes Hadoop C++ Extension (short for HCE) changes to mapreduce-963075.

          The steps for using this patch is as follows:
          1. Download HADOOP-HCE-1.0.0.patch
          2. svn co -r 963075 http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk trunk-963075;
          3. cd trunk-963075;
          4. patch -p0 < HADOOP-HCE-1.0.0.patch
          5. sh build.sh (need java, forrest and ant)

          HCE includes java and c++ codes, which depends on libhdfs, so in this build.sh we first check out hdfs trunk and build it.

          Show
          Dong Yang added a comment - Here is a HADOOP-HCE-1.0.0.patch for mapreduce trunk (revision 963075), which includes Hadoop C++ Extension (short for HCE) changes to mapreduce-963075. The steps for using this patch is as follows: 1. Download HADOOP-HCE-1.0.0.patch 2. svn co -r 963075 http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk trunk-963075; 3. cd trunk-963075; 4. patch -p0 < HADOOP-HCE-1.0.0.patch 5. sh build.sh (need java, forrest and ant) HCE includes java and c++ codes, which depends on libhdfs, so in this build.sh we first check out hdfs trunk and build it.
          Hide
          Dong Yang added a comment -

          HCE-1.0.0.patch for mapreduce trunk (revision 963075)

          Show
          Dong Yang added a comment - HCE-1.0.0.patch for mapreduce trunk (revision 963075)
          Hide
          Allen Wittenauer added a comment -

          This patch appears to contain code from the C++ Boost library. Someone needs to do the legwork to determine the legality of the patch.

          Show
          Allen Wittenauer added a comment - This patch appears to contain code from the C++ Boost library. Someone needs to do the legwork to determine the legality of the patch.
          Hide
          Doug Cutting added a comment -

          Looks like BSD:

          http://www.boost.org/LICENSE_1_0.txt

          So we'd just need to append it to LICENSE.txt, noting there which files are under this license.

          Show
          Doug Cutting added a comment - Looks like BSD: http://www.boost.org/LICENSE_1_0.txt So we'd just need to append it to LICENSE.txt, noting there which files are under this license.
          Hide
          koth chen added a comment -

          I don't think pipes based map/reduce task will performance better than JNI based!! Why you guys think socket communication will be better than JNI method call!
          I've written a JNI based framework for C++ Map/Reduce Task,and porting the Hbase's HFile to my framework for input/output format. It works great!

          Show
          koth chen added a comment - I don't think pipes based map/reduce task will performance better than JNI based!! Why you guys think socket communication will be better than JNI method call! I've written a JNI based framework for C++ Map/Reduce Task,and porting the Hbase's HFile to my framework for input/output format. It works great!
          Hide
          Binglin Chang added a comment -

          Koth, In HCE socket is only used for passing control messages(not like c++ pipes), which has little impact on performance, as for data processing, such as input/map/mid-output/reduce/output, since everything is implemented in C++, JNI is not needed, except reading input from HDFS and writing output to HDFS, HCE uses libhdfs, which is JNI based.
          I think JNI based C++ extension for MR have the advantage of non-intrusive, and has better compatibility. In current HCE design, we need to reimplement many features already exists in Java, some of those get performance benefit(sort, spill), some of those are purely duplicate work.
          In current HCE design, if you wan't performance benefits in HCE, the only way is to use HCE interface, my thought is to extract the high performance part(sort, spill, compression in MapOutputCollector), wrap it using JNI as native lib like compress codecs, a jobconf item is used to enable/disable native optimization, so the code is compatible and java based jobs can also get performance benefits.

          Show
          Binglin Chang added a comment - Koth, In HCE socket is only used for passing control messages(not like c++ pipes), which has little impact on performance, as for data processing, such as input/map/mid-output/reduce/output, since everything is implemented in C++, JNI is not needed, except reading input from HDFS and writing output to HDFS, HCE uses libhdfs, which is JNI based. I think JNI based C++ extension for MR have the advantage of non-intrusive, and has better compatibility. In current HCE design, we need to reimplement many features already exists in Java, some of those get performance benefit(sort, spill), some of those are purely duplicate work. In current HCE design, if you wan't performance benefits in HCE, the only way is to use HCE interface, my thought is to extract the high performance part(sort, spill, compression in MapOutputCollector), wrap it using JNI as native lib like compress codecs, a jobconf item is used to enable/disable native optimization, so the code is compatible and java based jobs can also get performance benefits.
          Hide
          Arun C Murthy added a comment -

          Can someone please help me understand the relationship between this jira and MAPREDUCE-2446?

          Show
          Arun C Murthy added a comment - Can someone please help me understand the relationship between this jira and MAPREDUCE-2446 ?
          Hide
          Arun C Murthy added a comment -

          With MAPREDUCE-279, we can now support alternate runtimes for MapReduce - do you guys want to take a look and see if we can integrate more closely? The Java layer might be completely unnecessary now...

          Show
          Arun C Murthy added a comment - With MAPREDUCE-279 , we can now support alternate runtimes for MapReduce - do you guys want to take a look and see if we can integrate more closely? The Java layer might be completely unnecessary now...
          Hide
          Binglin Chang added a comment -

          Hi, Arun
          HCE2.0 is mainly focused on stability(bugfix) and usability
          Bugfix: HCE is not very stable right now, although we fix a lot bugs, current codebase is a mess a lot work need to be done, but currently no time(other projects).
          Usability: (bi)streaming over HCE is now released, and PyHCE, as (bi)streaming & python is much popular than java api in Baidu; C++ version of partitioners such as KeyFieldBasedPartitioner; Input/OuputFormats such as SequenceFile, CombineInput.., multiple output; and compression codecs such as lzma, lzo, quicklz;
          As for performance, SSE optimization(memcmp, memchr) are used(crc32c not added yet), we gain another 10-20%, both in Hadoop & upper level application.

          About MR-v2
          We are keep watching your progress and have read your design doc & some code already, looking forward further discussion on this very interesting topic.

          Show
          Binglin Chang added a comment - Hi, Arun HCE2.0 is mainly focused on stability(bugfix) and usability Bugfix: HCE is not very stable right now, although we fix a lot bugs, current codebase is a mess a lot work need to be done, but currently no time(other projects). Usability: (bi)streaming over HCE is now released, and PyHCE, as (bi)streaming & python is much popular than java api in Baidu; C++ version of partitioners such as KeyFieldBasedPartitioner; Input/OuputFormats such as SequenceFile, CombineInput.., multiple output; and compression codecs such as lzma, lzo, quicklz; As for performance, SSE optimization(memcmp, memchr) are used(crc32c not added yet), we gain another 10-20%, both in Hadoop & upper level application. About MR-v2 We are keep watching your progress and have read your design doc & some code already, looking forward further discussion on this very interesting topic.
          Hide
          linyihang added a comment -

          Hello Mr Yang,
          I am new to HCE,when I download the HCE 1.0 and try to build it by "./build.sh",but failed.there is the errors:
          "../hadoop_hce_v1/hadoop-0.20.3/../java6/jre/bin/java: 1: Syntax error: "(" unexpected
          ../hadoop_hce_v1/hadoop-0.20.3/../java6/jre/bin/java: 1: Syntax error: "(" unexpected
          "
          Then I modify the build.sh by comment the lines below,
          " # prepare the java and ant ENV
          #export JAVA_HOME=$

          {workdir}/../java6
          #export ANT_HOME=${workdir}

          /../ant
          #export PATH=$

          {JAVA_HOME}

          /bin:$

          {ANT_HOME}

          /bin:$PATH",for my having install jdk1.6.0_21 and the ant1.8.2;
          But there seems to be some errors like
          "
          [exec] /usr/include/linux/tcp.h:77: error: ‘_u32 __fswab32(_u32)’ cannot appear in a constant-expression
          ",and never meet "BUIED SUCCESSFULY" as the InstallMenu.pdf showing.My OS is UBUNTU 10.10 ,and GCC is version 4.4.5.

          Is there any error I hava make?

          Something further,then I try to fix the error as someguy strikes me on Google by replacing "#include <linux/tcp.h>" with "#include<netinet/tcp.h>" in the "../hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Commo
          n/Type.hh",also add "#include<stdint.h>" in the Type.hh. But there takes place to be a lot of mistakes(which I thought) like mistaking "printf("%lld")" by "printf("lld")",and one serious error as follow witch worry me a lot.
          the serious error is ,
          "
          [exec] then mv -f ".deps/CompressionFactory.Tpo" ".deps/CompressionFactory.Po"; else rm -f ".deps/CompressionFactory.Tpo"; exit 1; fi
          [exec] In file included from /usr/include/limits.h:153,
          [exec] from /usr/lib/gcc/i686-linux-gnu/4.4.5/include-fixed/limits.h:122,
          [exec] from /usr/lib/gcc/i686-linux-gnu/4.4.5/include-fixed/syslimits.h:7,
          [exec] from /usr/lib/gcc/i686-linux-gnu/4.4.5/include-fixed/limits.h:11,
          [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/../../../../nativelib/lzo/lzo/lzoconf.h:52,
          [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/../../../../nativelib/lzo/lzo/lzo1.h:45,
          [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Compress/LzoCompressor.hh:23,
          [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Compress/LzoCodec.hh:27,
          [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Compress/CompressionFactory.cc:23:
          [exec] /usr/include/bits/xopen_lim.h:95: error: missing binary operator before token "("
          [exec] /usr/include/bits/xopen_lim.h:98: error: missing binary operator before token "("
          [exec] /usr/include/bits/xopen_lim.h:122: error: missing binary operator before token "("
          [exec] make[1]:正在离开目录 `/home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/build/c++-build/Linux-i386-32/hce/impl/Compress'
          [exec] make[1]: *** [CompressionFactory.o] 错误 1
          [exec] make: *** [install-recursive] 错误 1
          "

          How can I fix this error?

          Show
          linyihang added a comment - Hello Mr Yang, I am new to HCE,when I download the HCE 1.0 and try to build it by "./build.sh",but failed.there is the errors: "../hadoop_hce_v1/hadoop-0.20.3/../java6/jre/bin/java: 1: Syntax error: "(" unexpected ../hadoop_hce_v1/hadoop-0.20.3/../java6/jre/bin/java: 1: Syntax error: "(" unexpected " Then I modify the build.sh by comment the lines below, " # prepare the java and ant ENV #export JAVA_HOME=$ {workdir}/../java6 #export ANT_HOME=${workdir} /../ant #export PATH=$ {JAVA_HOME} /bin:$ {ANT_HOME} /bin:$PATH",for my having install jdk1.6.0_21 and the ant1.8.2; But there seems to be some errors like " [exec] /usr/include/linux/tcp.h:77: error: ‘_ u32 __fswab32( _u32)’ cannot appear in a constant-expression ",and never meet "BUIED SUCCESSFULY" as the InstallMenu.pdf showing.My OS is UBUNTU 10.10 ,and GCC is version 4.4.5. Is there any error I hava make? Something further,then I try to fix the error as someguy strikes me on Google by replacing "#include <linux/tcp.h>" with "#include<netinet/tcp.h>" in the "../hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Commo n/Type.hh",also add "#include<stdint.h>" in the Type.hh. But there takes place to be a lot of mistakes(which I thought) like mistaking "printf("%lld")" by "printf("lld")",and one serious error as follow witch worry me a lot. the serious error is , " [exec] then mv -f ".deps/CompressionFactory.Tpo" ".deps/CompressionFactory.Po"; else rm -f ".deps/CompressionFactory.Tpo"; exit 1; fi [exec] In file included from /usr/include/limits.h:153, [exec] from /usr/lib/gcc/i686-linux-gnu/4.4.5/include-fixed/limits.h:122, [exec] from /usr/lib/gcc/i686-linux-gnu/4.4.5/include-fixed/syslimits.h:7, [exec] from /usr/lib/gcc/i686-linux-gnu/4.4.5/include-fixed/limits.h:11, [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/../../../../nativelib/lzo/lzo/lzoconf.h:52, [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/../../../../nativelib/lzo/lzo/lzo1.h:45, [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Compress/LzoCompressor.hh:23, [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Compress/LzoCodec.hh:27, [exec] from /home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/Compress/CompressionFactory.cc:23: [exec] /usr/include/bits/xopen_lim.h:95: error: missing binary operator before token "(" [exec] /usr/include/bits/xopen_lim.h:98: error: missing binary operator before token "(" [exec] /usr/include/bits/xopen_lim.h:122: error: missing binary operator before token "(" [exec] make [1] :正在离开目录 `/home/had/文档/HCE/bak/hadoop_hce_v1/hadoop-0.20.3/build/c++-build/Linux-i386-32/hce/impl/Compress' [exec] make [1] : *** [CompressionFactory.o] 错误 1 [exec] make: *** [install-recursive] 错误 1 " How can I fix this error?
          Hide
          Mikhail Bautin added a comment -

          Hello HCE Developers,

          Would it be possible to post the most recent / stable version of HCE for download? It would be even better if you could continuously push your HCE code changes to e.g. a github repository.

          Thanks,
          Mikhail

          Show
          Mikhail Bautin added a comment - Hello HCE Developers, Would it be possible to post the most recent / stable version of HCE for download? It would be even better if you could continuously push your HCE code changes to e.g. a github repository. Thanks, Mikhail
          Hide
          Dong Yang added a comment -

          Hi, Mikhail, Yihang

          I am so sorry I can't post the most recent / stable version of HCE for download, some limitations frustrate me.

          Now we redirect HCE to MAPREDUCE-2841 (Task level native optimization), which is the new implementation base HCE, and provides higher performance imporvement.

          We will contribute to MAPREDUCE-2841 continuously, please watch this jira~

          Thanks,
          Dong

          Show
          Dong Yang added a comment - Hi, Mikhail, Yihang I am so sorry I can't post the most recent / stable version of HCE for download, some limitations frustrate me. Now we redirect HCE to MAPREDUCE-2841 (Task level native optimization), which is the new implementation base HCE, and provides higher performance imporvement. We will contribute to MAPREDUCE-2841 continuously, please watch this jira~ Thanks, Dong

            People

            • Assignee:
              Unassigned
              Reporter:
              Wang Shouyan
            • Votes:
              7 Vote for this issue
              Watchers:
              55 Start watching this issue

              Dates

              • Created:
                Updated:

                Development