First, Arun and Todd, thank you both for your honest opinions! You are both respected!
I believe the differences will narrow after we see the same facts, I'd like to state the facts and clarify some confusions:
1. How many lines of code on earth?
Here is a breakdown for branch https://github.com/intel-hadoop/nativetask/tree/native_output_collector:
java code(*.java) 122 files, 8057 lines
nativetask 62 files, 4080 lines
nativetask Unit Test 14 files, 1222 lines
other platform pig/mahout/hbase/hive 25 files, 477 lines
scenario test 21 files, 2278 lines,
native code(*.h, *.cc) 128 file, 47048 lines
nativetask 85 files, 11713 lines
nativetask Unit Test 33 files, 4911 lines,
otherPlatform pig/mahout/hbase/hive 2 files, 1083 lines,
thirdparty gtest lib header files 3 files, 28699 lines
thirdparty lz4/snappy/cityhash 5 files, 642 lines
(Note: All license header lines in each source file are not counted, blanks and other comments are counted)
If we measure the LOC in the sense of code complexity, then:
Third party code like google test header files should not be counted，gtest head alone has 28699 lines of code.
Pig/mahout/hbase/hive code will be removed from the code repository eventually, and should not be counted.
Scenario test code may not be included, as you can always write new scenario tests.
So after the deduction, effective code contains,
NativeTask Source Code(java + native C++): 15793 lines
NativeTask Unit test Code(java + native C++): 6133 lines
2. Is this patch used as alternate implementation of MapReduce runtime, like TEZ?
No, the whole purpose of this patch submission is to act as an Map Output Collector, which transparently improve MapReduce performance, NOT as a new MR engine.
The code is posted at branch https://github.com/intel-hadoop/nativetask/tree/native_output_collector, it only includes code for map output collector.
3. Why there are Pig/Mahout/HBase/Hive code in native task source code?
We are working on removing platform(Hive/Pig/HBase/Mahout) code from native task source code a I commented above, and provide them as standalone jars.
We rushed to post the link without fully cleanup so that we can get some early feedback from community.
4. Is the Full native runtime included?
No, full native runtime is not included in this patch, and related code is stripped. Repo https://github.com/intel-hadoop/nativetask/tree/native_output_collector only contains code for transparent collector.
5. Are there intention to contribute the full native runtime node to Hadoop? or act as a separate project?
It is not the purpose of this patch to support full native runtime mode, the goal of this patch is to make existing MR job runs better on modern CPU with native map output collector.
Full native runtime mode is another topic, there is a long way for that to be ready for submission, We don't want to consider that now.
6. Are there interface compatibility issue?
This patch is not about full native runtime mode which supports native mapper and native reducer.
This patch is only about a custom map output collector in transparent mode. We are using existing java interfaces, and people are still running their java version mapper/reducer without re-compilation. User can make a small configure change to enable this nativetask collector. When there is a case that nativetask don't support, it will simply fallback to default MR implementation.
7. Are there C++ ABI issue?
The concern make sense.
Regarding ABI, if the user don't need custom key comparator, he will never need to implement native comparator on nativetask.so, so no ABI issue.
If the user do want to write a native comparator, the nativetask native interface involved is very limited, only:
typedef int (*ComparatorPtr)(const char * src, uint32_t srcLength, const char * dest,
However, the current code will assume user to include whole "NativeTask.h", which contains more stuff than the typedef above.
We will work on this to make sure that "NativeTask.h" only expose necessary minimum API. After we do this, there should be no big ABI issue.
8. How can you make sure the quality of this code?
The code has been actively developed more than 1 year. It has been used and tested in production for a very long time, and there are also full set of unit test and scenario test for coverage.
9. Can this be worked on TEZ instead?
We believe it is good for MapReduce, we know people are still using Mapreduce, that justify the reason that we submit the patch.
Whether TEZ can benefit from this feature or not should not impact the decision made here, as they are two projects.
If the value of this patch is appreciated, we are totally open to ALSO migrate this to TEZ. We may want your help on it.
10. How to make sure the code will be well maintained in the future, considering the fact that the bug is opened so long ago and not maintained well?
If you check commit history at https://github.com/intel-hadoop/nativetask/commits/native_output_collector, you will find that the code has been actively developed continuously over years, and they are in public. Till now, there are 319 commits. I personally started to work on this project since end of 2012.
Now We post it here because we believe after all these commits the code quality has reached a level that worth contributing back formally to hadoop. The code is stable, and has been well tested in production environment. We are very serious on this and definitely will take responsibility to maintain it. Besides, we have Todd's help on this.
11. Why you still work on MapReduce, as the new investment is not in Mapreduce?
We believe as long as there are still big user base who use MapReduce in daily job, no matter how slow MR new feature evolves or where the new investment goes, there is still value for us to contribute to it. Intel has a history to support many decades old projects.
We are also open to apply the "nativetask" ideas to other projects, but it is not directly related with this jira.
12. How to support encrypted shuffle in nativetask?
First, there are a flag to enable/disable the native output collector feature, with full control from end user.
Second, if there are cases nativetask don't support, nativetask will simply fallback to default mapreduce implementation.
For encrypted shuffle, we will fallback to default MR implementation. However, we are totally open to support further cases like this as long as there are still performance benefits, otherwise we will just fallback.
13. Why the code is at github https://github.com/intel-hadoop/nativetask/tree/native_output_collector?
This is just a place for early code review and get some early feedbacks. We will submit it as a standard patch after we clean up hive/pig platforms.
14. Why not put it as incubator project instead of putting this in MR code base?
a) It is good for mapreduce, it help mapreduce to run better.
b) More mapreduce user can use this. It is generally more easy for a user to update hadoop and use this feature if provided in hadoop package.
c) The performance benefit for mapreduce is big. We believe low level CPU optimization is still important, knowing the facts that there are other optimization on computation flow like TEZ, Spark, Impala and etc..
d) This map output collector is transparent, it is not unreasoned to provide this as a alternative collector in hadoop package. there are similar practice before, like the fair scheduler is bundled in similar way.
e) Risk is low. Nativetask will fallback to default MR implementation if some case is not supported or goes wrong.
f) Change is small. There are only 10 lines of code change required for hadoop, we expect no regression possibility. Patch for hadoop is provided at:
https://github.com/intel-hadoop/nativetask/blob/native_output_collector/patch/hadoop-2.patch Other nativetask code are self-contained, and won't pollute mapreduce.
In runtime, we only requires deploying the nativetask jars to hadoop directory when installing.
These thinkings make us tend to feel that it is better to do this in hadoop instead of a new project.
The above questions was interpreted from the comments in a very subjective manner, I suspect there will be points missing or wrongly interpreted that need further elaboration or correction. What are them?
I am sure there are still concerns for this patch, but before we say no to reject, maybe we should pause and think:
Are there still big user base using Mapreduce? Is this patch valuable for those users? Is the risk involved by this patch controlled? Can these concerns be solved by improving the patch?
In the hadoop eco-system, there are many high level innovation like TEZ, Impala, Spark and etc.., but there are far less innovation or optimization in micro-architecture. This patch is a try.