|
[
Permlink
| « Hide
]
Ioannis Koltsidas added a comment - 17/Jun/08 09:02 PM
Initial Design Description
One challenging problem is to detect slow links/nodes. A description of the package and the implementation
The code for running FailMon outside Hadoop (as a standa-alone monitoring tool)
Patch for running FailMon on every NameNode and DataNode.
We have uploaded an initial version of our tool. Using the patch for the trunk code, one can run FailMon on every DataNode and NameNode. All data gathered are uploaded into HDFS.
Also provided is an OfflineAnonymizer that anonymizes system and hadoop log files, so that they can be easily distributed. Details can be found in the attached FailMon_Package_Descrip.html. Our greatest concern now is to be able to identify read hardware failures from the gathered data. To that end, we need to gather as many data from real clusters as possible, to that we can see how all kinds of errors and failures are actually logged by the system and hadoop. By correlating them, we will be able to systematically identify actual failures. So, you are very welcome to use our patch and share the collected data and/or anonymize and share any log files you may already have Some comments from a quick look at the code
This is interesting, but I'd like to see it deployable as a standalone Service under the service code I'm putting together, rather than hidden under every kind of hadoop service that can be brought up, and the polling worries me. Others may have different opinions, Cool stuff!
1. It would be really nice to be able to deploy this without changing the namenode/datanode. One option would be to manually start your scheduler (that looks for the next report to be collected) and then runs a map-reduce job to collect the statistics. Is this possible using your current code? 2. Regarding the format of the serialized logs, can we use an existing serialization format rather than inventing another one? One option would be to store them as Java properties (name, value pairs) and then serialize them using Java serialization. Another option would be to use Hadoop recordio (org.apache.hadoop.record.*) 3. Instead of calling it the failmon package, a better name could be logcollector or something more general. The logs could be used to detect failures, analyze performance of specific machines, correlate events of one machine with another, etc. In the same vein, it might make sense to rename all configurable property names to the form "logcollector.nic.list, "logcollector.sensors.interval", etc.etc. 4. What happens when the framework tries to upload a file into HDFDS but the HDFS file already exists? Thanks very much for your input!
Regarding Steve's comments:
Regarding Dhruba's comments: 1. I agree with your idea, but I'm not sure how feasible it is. Some concerns about this approach:
2. I believe we can do that. I'll look into it. 3. I would be happy to change the name to whatever people think is more representative of the contents of the package. Maybe we can have a logcollector package and a failure monitoring subpackage (to capture that also system utilities are read and for the failure identification code). 4. The filename of the uploaded HDFS file has the form failmon<hostname><timestamp>.zip, so filenames are expected to be unique. In the same context, the best thing to do, in my opinion, would be to append all locally gathered records to an HDFS file, provided that the upload can be in a compressed form. I'm not very familiar with the append API yet and I also am not sure whether the communication can be compressed, but if it is feasible I think it would be the best way to go. In the current approach, if very small files are uploaded, a lot of space will be wasted (since the block size is large). Looks like this will be complementary with Chukwa project:
https://issues.apache.org/jira/browse/HADOOP-3719 Chukwa is an hdfs based storage system for collecting and mining log data. Chukwa will provide simple APIs for applications to push log data (and metrics data, or any kind of semi structured data) to the storage. Once the data get to the storage, one can run map/reduce jobs or pig jobs to mine the data. Currently, we are planning to implement a local agent that will collect the log files of Hadoop service processes (Data nodes, Name nodes, Task trackers, etc) and push the data to Chukwa storage. This agent will be running on a machine outside of Hadoop processes. This agent may also be used for collecting system and other application metrics. There seems to be two possible ways the failmon proposed in this Jira can work with Chukwa. This effort seems independent of providing a distributed file system (as evidenced by the availability of a standalone version). Could the implementation be decoupled from the DataNode/NameNode daemons? Many users will find this sort of hardware failure detection useful for their entire set of hosts (including nodes that are not otherwise running any hadoop daemons). Conversely, many Hadoop users will already be running software with similar functionality, and will not need or want Hadoop to provide it bundled with the DataNodes.
In that light, it seems like this would make more sense as a piece that can evolve independently of the Hadoop core releases, either as a sub-project or incubator project (I don't know what the Apache rules regrading those are) or as a contrib module (though that has the disadvantage of coupling the release cycles). Rick: Chukwa opted to go for the separate-process design for reasons along the lines you lay out. I haven't studied the failmon code very closely, but it looks like most of what it does could be done pretty easily from a separate process.
Runping: My sense is that having the failmon data collection code wrapped in a Chukwa adaptor [the first option you mentioned] is more convenient. That approach avoids the complications of failmon log rotation, and removes some unneeded components. Failmon was written with a fairly similar programming model, so the work involved in merging the two efforts should be quite modest. I think the capture stuff is independent of the nodes deployed , and so shouldn't be automatically started. When you run a whole cluster in VM during testing, you'd be deploying many duplicate monitors. Better to have some switch on the command line like -failmon to turn failure monitoring on for that process; that switch could start a failure monitor service alongside the rest of the system.
Thanks for your comments. I thing that integrating failmon with Chukwa and making it independent of TaskTrackers/DataNodes. Having the failmon data collection code wrapped in a Chukwa adaptor seems a very good idea to me. We can discuss it in more detail when the initial version of Chuckwa is posted...
Steve: Thanks for the suggestion, I'm working on this... Failmon description & usage manual
It would be nice if we could do the folowing:
1. Remove the code changes from Namenode.java and DataNode.java. Instead run this app from a bunch os shell scripts. 2. Move failmon.properties from conf to src/contrib/failmon/conf/ or something like that. 3. Make the code reside in src/contrib/failmon. Let it be a contrib project. 4. Write a junit test to test some amount of functionality. It could be based on standalone class testing. 5. Integrate with the over build process so that "ant compile-contrib" builds FailMon too. Similarly, "ant test" should run FailMon junit test(s). 6. Maybe some people from the chukwa project should browse this code and give a +1. Once these are done, we should check this as a contrib project. Release of FailMon as a contrib project, with some additional features and many bug fixes. Please refer to the user manual (failmon2.pdf) for a complete description and instructions for deployment and execution of FailMon, especially Section 4. File FailMon_QuickStart.html provides a guide to quickly set up and run FailMon. Here is the summary of changes we have made since the previous patch:
Failmon Description and User Manual
Curious observer's comment to the following statement:
"FailMon is now a contrib project and its code is decoupled from the Hadoop core." Would it then make sense to package and publish this separately? Publishing it in Hadoop's contrib may hide it from those who could use failure monitoring outside Hadoop, but do not know to look for this gem in Hadoop's contrib. Thanks for your comment, Otis. By "decoupled" I mean that it is not started directly by a Hadoop component, as it was in the initial version (then, it was started by NameNode.java, DataNode.java). However, since FailMon not only uses Hadoop, but also is tailored for Hadoop log collection, we believe it is a good idea to be part of the project (since this will make it more visible to people running large clusters, since most of them use Hadoop).
In order to make ti more visible (and more usable in the first place Attached are a couple of threads of email conversation pertinent to this issue, in summary there is a strong interest in committing both the FailMon and Chukwa projects and awaiting user feedback.
Ariel Rabkin <asrabkin@EECS.Berkeley.EDU> wrote on 08/04/2008 03:23:04 PM: > As near as I could gather from the failmon code – and Prasenjit Sarkar/Almaden/IBM wrote on 08/04/2008 03:19:45 PM: > Jerome, Comment from Mac Yang:
Mac Yang <macyang@yahoo-inc.com> wrote on 08/05/2008 09:05:28 AM: > +1 for review
Please add copyright notices on top of source files Chukwa-FailMon integration: Regards, Ok, we got lots of comments from many people and everybody seems to agree that FailMon should go into contrib. If HadoopQA tests are successful, I will check them in.
Ok, the changes have to submitted as an "svn diff" file, not a zip file.
Please use
Submit patch for HadoopQA tests
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387782/HADOOP-3585.2.patch against trunk revision 683671. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/testReport/ This message is automatically generated. I get a compilation error :
init-contrib: compile: jar: BUILD FAILED Total time: 44 seconds The unit tests have failed for some other reason ( I think) : http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3035/testReport/ Fixed findbugs errors and unit tests
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387992/HADOOP-3585.3.patch against trunk revision 685425. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 279 release audit warnings (more than the trunk's current 274 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3051/testReport/ This message is automatically generated. As pointed out by https://issues.apache.org/jira/browse/HADOOP-3949
As pointed out by https://issues.apache.org/jira/browse/HADOOP-3950 It is not clear to me yet where the release audit warnings come from. Does anyone know? They are all for non-java configuration files under /src/contrib/failmon/conf. Thanks The audit warnings are due to the fact that the patch has added new files without the Apache License. Please fix them, thanks!
Then why don't I get audit warnings for the README file?
Do I need to add the Apache License to configuration files? Other contrib projects (e.g. Chukwa) have configuration files without the license (as does the Hadoop core itself). Thanks I just committed this. Thanks Ioannis!
I added the license to the beginning of the log4j and other properties files.
It seems that the committed patch causes some javadoc warnings. See
Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||